arXiv Papers of Object Segmentation

Abstract:
Open‑vocabulary semantic segmentation (OVSS) in remote sensing images aims to segment categories beyond a fixed label space. Recent SAM 3‑based methods provide a promising training‑free foundation, yet three key issues remain: (1) a single class‑name prompt lacks sufficient semantic coverage for complex remote sensing categories; (2) expanding each category into multiple prompts introduces redundant online text encoding; and (3) directly aggregating multiple prompt responses propagates noisy activations into the final prediction. To address these issues, we propose ProC‑SAM3, which calibrates SAM 3's prompt interface for remote sensing OVSS from three complementary aspects. First, we construct an offline prompt pool where a Category Matcher groups MLLM‑generated candidates into per‑category sets, and Expansion Constraints further refine each set using category‑specific prior knowledge. Second, the resulting text embeddings are cached and reused across all test images, eliminating repeated text encoding. Third, we introduce Presence‑Guided Residual Fusion to gate unreliable decoder outputs by prompt presence and confidence, followed by peak‑preserving class aggregation that retains fine‑grained activations for small and sparse objects. Experiments on eight benchmarks show that ProC‑SAM3 achieves an average mIoU of 56.1%, outperforming the previous best training‑free method by 3.9 percentage points. Code will be available at https://github.com/YanghuiSong/ProC‑SAM3.

Abstract:
In this report, we present our submission to the GOOSE 2D Fine‑Grained Semantic Segmentation Challenge, organized as part of the Workshop on Field Robotics at ICRA 2026. The challenge combines data from the GOOSE and GOOSE‑Ex datasets, which comprise more than 13k images captured from 4 distinct camera setups, annotated using a hierarchical taxonomy of 56 fine‑grained classes and 11 broader categories. Starting from SegFormer as a baseline, we progressively improve segmentation performance through increased training crop sizes, a transition to the query‑based Mask2Former architecture, and test‑time augmentation. Our experiments show that query‑based segmentation significantly outperforms the baseline model. Furthermore, increasing the crop size used during training yields substantial gains, highlighting the relevance of preserving scene context for fine‑grained semantic disambiguation. Our final submission, using test‑time augmentation, achieves an mIoU of 69.6% on the challenge test set, providing a strong baseline for fine‑grained semantic segmentation in outdoor environments. To facilitate reproducibility and future research, code and weights will be made publicly available at https://github.com/RoboticsLabURJC/outdoor‑fine‑grained‑segmentation .

Abstract:
In multi‑object text‑to‑image (T2I) diffusion, ensuring semantic consistency between textual prompts and generated visual content is crucial for image synthesis. However, such consistency constraint is often underemphasized in the denoising process of diffusion models. Although token supervised diffusion models can mitigate this issue by learning object‑wise consistency between the image content and object segmentation maps, it tends to suffer from the problems of segmentation map bias and semantic overlap conflict, especially when involving multiple objects. In this paper, we propose ELDiff, a new evidential learning‑supervised T2I diffusion model, which leverages the advantages of uncertainty metric and conflict detection to enhance the fault tolerance of unreliable segmentation maps and suppress semantic conflicts, strengthening object‑wise consistency learning. Specifically, a pixel evidence loss is proposed to restrain overconfidence in unreliable labels through evidential regularization, and a token conflict loss is designed to weaken the contradiction between semantics through optimizing a measured conflict factor. Extensive experiments show that our ELDiff outperforms existing training based and train‑free based T2I diffusion models on SD v1.4, SD v2.1, SDXL, SD v3.5, and Qwen‑Image, without requiring additional inference‑time manipulations. Notably, ELDiff can be seamlessly extended to the existing training pipeline of T2I diffusion models. Code can be found at https://github.com/QingtaoPan/ELDiff.

Abstract:
Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi‑stage paradigm: class activation map (CAM) generation, offline pseudo‑mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi‑stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false‑positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single‑Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post‑hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high‑fidelity activation maps directly within a single training loop. Experiments on the LUAD‑HistoSeg and BCSS datasets demonstrate that SSHR outperforms state‑of‑the‑art multi‑stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large‑scale histopathology workflows. The code is available at: https://github.com/trongduc‑nguyen/SSHR

Abstract:
Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature‑based knowledge distillation (KD) often suffers from the teacher‑student gap; the student struggles to imitate teacher's complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer‑skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature‑based knowledge distillation. By utilizing the teacher's intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher‑level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP‑distilled ViT‑S achieves 90.1% accuracy on ImageNet‑100, a +12.24% improvement compared with baseline. On ImageNet‑1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet‑100 by implementing early‑stopping for teacher inference during the initial stages of training. Code is available at https://github.com/KevinZ0217/LEAP

Abstract:
We present GOOSE‑M2F, a task‑specific adaptation of Mask2Former for the GOOSE 2D Fine‑Grained Semantic Segmentation (FGSS) Challenge at ICRA 2026. The GOOSE benchmark spans 64 fine‑grained classes across unstructured outdoor terrain with a severely long‑tailed distribution, where rare classes occupy fewer than 50 pixels per image. We extend the Swin‑Large Mask2Former baseline with three targeted contributions: (1) 200 object queries to eliminate representational saturation; (2) a Feature Refinement Module (FRM) combining ASPP‑lite and CBAM dual‑attention; and (3) an Auxiliary Supervision Head that delivers direct per‑pixel gradients for rare classes. A multi‑stage training strategy pairs Distribution‑Balanced loss, Rare‑Class Copy‑Paste augmentation, dynamic IoU‑aware re‑weighting, and EMA. At inference, a dense sliding‑window engine with 2D Gaussian kernel blending and 4‑scale TTA adds +10.57%. GOOSE‑M2F achieves 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), placing 3rd on the GOOSE 2D FGSS leaderboard. Code and trained models are publicly available at GitHub: https://github.com/Aditya‑Lingam‑9000/GOOSE‑M2F and Hugging Face: https://huggingface.co/XYZ9843/GOOSE‑M2F.

Abstract:
Images can be segmented based on visual cues (i.e., texture segmentation) or into objects (i.e., semantic segmentation). We propose a new category of sub‑semantic image segmentation that blurs the line between the two. In sub‑semantic image segmentation, language is not used to name whole objects. Instead, it is used to partition an image into stable appearance patterns that can be described by language. To do that, we couple a general‑purpose vision‑language model to SAM 3, a promptable segmentation backbone whose native text pathway can ground rich descriptions into masks. Simple coupling fails for a number of reasons that we identify in the paper, and we overcome them by introducing DETECTURE that resolves three concrete failure modes ‑‑ language leakage between texture regions, prompt competition inside the segmentation backbone, and semantic distortion at the language‑to‑mask interface. Since there is no dataset of sub‑semantic image segmentation, we introduce one, termed TextureADE. The new dataset is derived from the ADE20K dataset using a system we designed. We compare DETECTURE to a number of baselines and find that it achieves the strongest performance on several datasets using different metrics. Code is available at https://github.com/Scientific‑Computing‑Lab/TextureDetecture.

Abstract:
Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch‑token grids due to the quadratic cost of global self‑attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task‑agnostic feature upsamplers. While recent state‑of‑the‑art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT‑Up, an implicit feature upsampling framework that replaces external image guidance with layer‑wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT‑Up consistently outperforms state‑of‑the‑art image‑guided upsamplers across dense prediction and semantic correspondence. On DINOv3‑S+, ViT‑Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair‑71k. With the larger DINOv3‑B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT‑Up scales favorably with backbone capacity.

Abstract:
Object detection and instance segmentation tasks are closely related. Existing top‑down instance segmentation methods usually follow a detect‑then‑segment paradigm, where an initial detector is used to recognize and localize objects with bounding boxes, followed by the segmentation of an instance mask within each bounding box. In such methods, the detection accuracy directly influences the subsequent segmentation performance. However, previous research has seldom explored the impact of the instance segmentation task on object detection. In this paper, we present a turbo‑inference strategy for the top‑down methods that leverages the complementary information between detection and segmentation tasks iteratively. Specifically we design two modules: turbo‑detection head and turbo‑segmentation head, which facilitate communication between the tasks. The two modules form a closed loop that interlaces the detection and segmentation results without retraining the model. Comprehensive experiments on the COCO, iFLYTEK, and Cityscapes datasets demonstrate that our method substantially enhances both detection and segmentation accuracies with a certain increase in computational cost. The proposed method represents a tradeoff between prediction accuracy and inference speed. Codes are available at https://github.com/zhaozhen2333/Turbo‑Learning.git.

Abstract:
We introduce WorldOlympiad, a benchmark for diagnosing video‑based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short‑term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world‑model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM‑as‑judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross‑view coherence, and camera‑trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real‑world videos, capturing diverse challenges from interactive control and embodied manipulation to open‑domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state‑of‑the‑art models reveal substantial gaps in physical reasoning, 3D consistency, and long‑horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.

Abstract:
Precise rover localization is a prerequisite for autonomous lunar exploration, yet the absence of Global Navigation Satellite System (GNSS) signals and the cumulative drift of local localization methods severely constrain long‑range missions. Cross‑view localization provides a promising drift‑free global solution by matching rover‑view and satellite‑view imagery. However, the lunar environment poses unique challenges for correspondence alignment, including inter‑entity entanglement, inter‑viewpoint divergence, and simulation‑to‑real domain shift. To address these challenges, we propose Warped Alignment of Reprojected Graphs (WARG), a framework that leverages unified graph learning and reprojected graph matching for robust cross‑view alignment. Pretrained on the synthetic LuSNAR dataset, WARG achieves an average test error of 0.32 m and demonstrates robust zero‑shot generalization to the synthetic lunar south pole region with an error of 3.63 m. More importantly, when validated on real‑world data from the YuTu‑2 rover, WARG achieves a localization error of 1.68 m within a 100 m x 100 m search area, corresponding to nearly one‑pixel precision in low‑resolution satellite imagery with a spatial resolution of 1.40 m/pixel. Beyond accuracy, WARG is computationally efficient, containing only 1.56M parameters, corresponding to 16.12% of previous lightweight models, and operating at 5.49 Hz on an NVIDIA RTX A6000 GPU, approaching GNSS‑level update frequency. Finally, we observe that WARG naturally develops low‑level spatial awareness, including semantic segmentation and structural reasoning, through cross‑view localization learning, highlighting its potential as a promising paradigm for spatial intelligence with minimal annotation cost. The source code is available at https://github.com/maochen‑casia/warg.

Abstract:
Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel‑level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose VideoSEG‑O3, the first multi‑turn reinforcement learning framework for RVOS that emulates the human ``coarse‑to‑fine'' cognitive process. It employs a multi‑turn temporal‑spatial chain‑of‑thought to capture fine‑grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt[SEG] during the RL stage, we introduce SEG‑aware logit calibration, which integrates pixel‑wise segmentation feedback directly into the token‑level logits. Furthermore, we design a decoupled thinking trace to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct VTS‑CoT, a specialized cold‑start dataset featuring comprehensive reasoning trajectories. The code and models will be released at https://github.com/Dmmm1997/VideoSEG‑O3.

Abstract:
Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational budgets. Prior work typically optimizes for one of these axes: attention for global context, convolution for local detail, or compactness for efficiency. While hybrid approaches aim to capture both, they require architectural changes and encoder backbones with computational overhead, limiting efficiency and performance. We present LALE (Lightweight‑transformer Architecture for Land‑cover Estimation), an end‑to‑end remote sensing image segmentation architecture, that bifurcates its encoder by resolution: lightweight ConvMixer stages handle high‑resolution local features, while transformer stages handle low‑resolution global context, confining the quadratic cost of self‑attention to deep, downsampled feature maps. An all‑MLP multi‑scale decoder, together with RMSNorm and StarReLU throughout, further reduces compute and parameter count. On the large‑scale ARAS400k remote‑sensing segmentation benchmark, LALE establishes a strong efficiency‑performance trade‑off against CNN, transformer, and hybrid baselines. Our smallest variant, (just 1.6M parameters), reaches within 2.6 F1 points of the best baseline (UPerNet) while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and delivering 1.8x higher throughput. The codebase for LALE is publicly available at https://github.com/caglarmert/LALE.

Abstract:
Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy‑Aware NECO, a single‑pass pixel‑wise out‑of‑distribution (OOD) detector for semantic segmentation. The method combines a centered NECO‑style geometric ratio computed from decoder features with a logit‑based Energy score. Both components are standardized using statistics fitted on a pure in‑distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel‑level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO‑only (0.8280), Energy‑only (0.8171), and an ensemble predictive‑entropy baseline (0.8124). Additional qualitative and operating‑point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single‑pass design. Code is available at https://github.com/boyuan‑zhangx/Energy‑Aware_NECO

Abstract:
Cell instance segmentation models trained on cell‑specific datasets suffer severe performance drops on out‑of‑distribution cell types, while interactive foundation models overcome this through per‑instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per‑instance O(N) to per‑type O(T), where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same‑type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain‑of‑Prompts (CoP), a training‑free framework that recursively expands a single user click by (1) identifying reliable same‑type locations through non‑parametric gating of multi‑scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell‑type‑annotated benchmarks, CoP with one click per type retains over 90% of per‑instance performance and surpasses fully‑supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo‑april.github.io/Chain‑of‑Prompts/

Abstract:
Instance‑level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance‑level analysis and limits downstream quantitative studies. Existing heuristic post‑processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning‑based instance segmentation approaches typically require intensive instance‑level annotations that are costly and labor‑intensive to obtain. We propose MORI‑Seg, a deep learning framework that enables instance segmentation without requiring instance‑level annotations. Instead of heuristic splitting or instance supervision, MORI‑Seg learns morphology‑aware geometric representations directly from semantic masks by jointly modeling object‑centric distance fields and boundary‑band representations to encode interior structure and contact interfaces. A class‑conditioned feature disentanglement module further promotes intra‑instance coherence and inter‑instance separation. Under semantic‑only supervision, MORI‑Seg decomposes connected semantic regions into distinct instance masks in an end‑to‑end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post‑processing pipelines and representative semantic‑to‑instance learning approaches. The official implementation is publicly available at https://github.com/ddrrnn123/MORI‑Seg.

Abstract:
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene‑level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint‑based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self‑supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi‑class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero‑shot and long‑tail scenarios, underscoring its potential for scalable, label‑free 3D object segmentation.

Abstract:
Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB‑Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi‑modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage‑wise, modality‑adaptive fusion strategies. For early‑stage features, we introduce a Frequency‑Based Fusion Module that decomposes infrared features into low‑ and high‑frequency components via Gaussian filtering, applies dual‑branch spatial attention to selectively emphasize thermal patterns and fine‑grained boundaries, and integrates them with RGB features through a confidence‑gated residual mechanism. For late‑stage features, we design a semantic fusion module with cross‑modal attention and multi‑scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet‑style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73% and 86.24% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at https://github.com/ismailemrecntz/VISIBLE‑INFRARED‑SENSOR‑FUSION

Abstract:
Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point‑guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training‑free framework for ultrasound video segmentation under sparse first‑frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision‑language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale‑Space Semantic Prompting, which first selects an optimal contextual view via a parameter‑free S.E.E.D. (Semantic Energy‑Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability‑Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state‑of‑the‑art performance under the sparse‑interactive setting, consistently outperforming training‑free baselines and finetuned specialists.

Abstract:
This report presents our solution for the WeatherProof Dataset Challenge, namely CVPR 2026 8th UG2+ Challenge Track 2: Semantic Segmentation in Adverse Weather. For the semantic segmentation task under adverse weather conditions, we propose a semi‑supervised segmentation pipeline. Our method is trained exclusively on the WeatherProof dataset, without using any additional external data. Specifically, we adopt UniMatch V2 as the baseline model and treat all degraded‑weather images as unlabeled data for semi‑supervised training, thereby fully exploiting the data distribution provided by the challenge. During inference, we further apply test‑time augmentation to improve the robustness and segmentation accuracy of the final predictions. The code is publicly available at: https://github.com/ylb888/weatherproof‑challenge‑unimatchv2.

Abstract:
Cooperative perception enabled by Vehicle‑to‑Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi‑agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper‑V2X, a hypernetwork‑based framework for estimating both epistemic and aleatoric uncertainties in V2X‑based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi‑agent features to generate weight distributions for stochastic Bird's‑Eye‑View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper‑V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture‑agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper‑V2X provides accurate, well‑calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open‑source license: https://github.com/abhishekjagtap1/Hyper‑V2X

Abstract:
Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well‑known gap, which relates to the unavailability of a large‑scale simulation‑based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi‑level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user‑specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case‑studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi‑agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress‑tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open‑sourced (https://github.com/LTU‑RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.

Abstract:
Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real‑world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine‑tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine‑tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine‑tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC‑SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine‑tuning and LoRA‑based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine‑tuning. Code is available at https://github.com/iremulku/Latent‑Space‑Guided‑Scenario‑Sampling

Abstract:
Monocular SLAM historically suffers from scale ambiguity and tracking failure in dynamic environments. While recent vision foundation models (VFMs) provide remarkable zero‑shot depth priors, naively integrating these deterministic predictions ignores predictive uncertainty and frame‑to‑frame scale inconsistencies. We propose PRISM‑SLAM, a real‑time framework that rigorously integrates VFM priors into a structured Bayesian factor graph to achieve scale‑aware, metric‑consistent localization and mapping. Specifically, we introduce a Plücker Ray‑Distance Factor to anchor monocular observations in absolute space within a globally consistent metric coordinate system, mathematically resolving scale drift by making the metric scale Fisher‑identifiable. To handle environmental dynamics, we derive an epistemic uncertainty proxy from temporal depth consistency and formulate a Dynamic Scene Uncertainty Gating (DSUG) mechanism. This soft‑gating approach probabilistically down‑weights dynamic distractors without incurring the heavy computational overhead associated with traditional semantic segmentation masks. By employing a multi‑process architecture that asynchronously processes VFM inference and geometric tracking, PRISM‑SLAM provides verified metric output at 30 FPS using solely RGB input, bridging the gap between foundation models and real‑world robotic applications. Evaluated on the TUM RGB‑D and 7‑Scenes benchmarks, PRISM‑SLAM achieves a metric SE(3) Absolute Trajectory Error (ATE) nearly identical to its oracle‑aligned Sim(3) error. This demonstrates that our system can produce deployment‑ready metric trajectories by delivering robust metric SLAM solutions without any post‑hoc scale correction. Project page: https://prismslam‑cmd.github.io/prismslam_pr/

Abstract:
Open‑vocabulary segmentation models such as SAM3 perform well across broad categories via text prompting, yet degrade when target classes are visually underrepresented in pretraining or depart from canonical depictions‑limitations text prompts cannot resolve spatially. We present SegRAG, a training‑free retrieval‑augmented segmentation framework that grounds SAM3 with class‑specific point prompts derived from a curated DINOv3 feature bank. Offline, dense patch‑level descriptors are extracted from annotated references and filtered by Intra‑Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within‑class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine‑similarity landscape against retrieved prototypes, identifies coherent high‑confidence regions via connected‑component analysis, and extracts peak locations through non‑maximum suppression. The resulting point prompts are delivered jointly with class‑name text in a single SAM3 forward pass. On four standard benchmarks, SegRAG consistently outperforms the text‑only baseline, gaining up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks under zero‑shot domain transfer, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablations confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at (https://github.com/boudiafA/SegRAG).

Abstract:
We investigate the potential of invariant and equivariant semi‑supervised learning for addressing the challenges of training multi‑task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi‑supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi‑supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi‑task learning from limited labeled data.

Abstract:
Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user‑provided queries are already precise and clear. However, this assumption is impractical. In real‑world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC‑Seg, a novel agentic framework that proactively clarifies user intent through multi‑turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi‑GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi‑RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC‑Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state‑of‑the‑art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE‑Laboratory/IC‑Seg.

Abstract:
Pre‑trained vision foundation models (VFMs) provide strong semantic representations, yet their patch‑level features are inherently coarse, limiting their effectiveness on tasks requiring fine‑grained localization, dense prediction, and point‑wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of inverse problem and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high‑level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov‑regularized least‑squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over‑smoothing. Moreover, WRC retains an efficient, fully differentiable closed‑form FFT solution, making it a practical drop‑in upsampling operator. Integrated into a lightweight self‑supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

Abstract:
While hyperspectral imaging provides rich spatial‑spectral information across hundreds of narrow wavelength bands for precise material identification, ground‑based hyperspectral pre‑trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground‑based hyperspectral pre‑trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel‑adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, we develop an unsupervised representation learning framework. Specifically, to address label scarcity and inconsistency, a multi‑source pseudo‑labeling method is introduced to fuse spatial structures from SAM2 and fine‑grained spectral material information from HyperFree. Furthermore, to enrich scene diversity and compensate for limited dataset scale, a cross‑modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre‑trained RGB vision model to our backbone. Pre‑trained on a collection of 15k images from 26 diverse ground‑based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head‑only adaptation without adjusting backbone parameters, it achieves state‑of‑the‑art performance compared to task‑specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation \mathrmAcc_\mathrmM, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre‑trained model will be publicly available on https://github.com/lronkitty/HyperVision .

Abstract:
Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation‑invariant or self‑attention‑based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in‑depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

Abstract:
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce 1GC‑7RC (Single Graphic Card: Seven Research Challenges), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time‑series forecasting, and text classification. Each task provides a locked data‑preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task‑specific wall‑clock budget (40‑120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open‑source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent‑task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time‑budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC‑7RC‑Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi‑agent settings, making it a flexible platform for future research on autonomous research agents.

Abstract:
Semantic segmentation of multi‑source remote sensing images is a fundamental task for Earth observation applications. Existing methods often struggle with insufficient multi‑scale context modeling and suboptimal cross‑modal feature fusion, limiting their performance in complex high‑resolution scenes. To this end, we propose Axial‑Relation Guided Fusion Mamba (ARG‑Mamba), a state space model‑based framework for optical‑elevation remote sensing image segmentation. Specifically, we introduce a Multi‑Scale State Space Module to capture both fine‑grained local details and global contextual dependencies with linear computational complexity. Moreover, an Axial‑Relation Guided Fusion Module is designed to explicitly model global cross‑modal correlations along horizontal and vertical axes, enabling efficient feature fusion between optical and elevation modalities. Extensive experiments conducted on the ISPRS Vaihingen and Potsdam datasets demonstrate that our ARG‑Mamba consistently outperforms state‑of‑the‑art methods while maintaining favorable computational efficiency. The code will be made publicly available at \urlhttps://github.com/oucailab/ARG‑Mamba.

Abstract:
Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI‑SG, the first dense monocular visual mapping system for open‑vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open‑vocabulary foundation models to partition the scene into rooms, deferring feed‑forward reconstruction to when each room is fully observed ‑‑ enabling scalable dense mapping without sliding‑window scale inconsistencies. We propose a room‑based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open‑vocabulary object segmentation and tracking. We validate LEXI‑SG on indoor scenes from the Habitat‑Matterport 3D and self‑collected egocentric office sequences. We evaluate its performance against existing feed‑forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open‑vocabulary segmentation. LEXI‑SG shows that accurate, scalable, open‑vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori‑drs.github.io/lexisg‑web/.

Abstract:
We introduce EvObj for unsupervised 3D instance segmentation that bridges the geometric domain gap between synthetic pretraining data and real‑world point clouds. Current methods suffer from structural discrepancies when transferring object priors from synthetic datasets (e.g., ShapeNet) to real scans (e.g., ScanNet), particularly due to morphological variations and occlusion artifacts. To address this, EvObj integrates two innovative modules: (1) An object discerning module that dynamically refines object candidates, enabling continuous adaptation of object priors to target domains; and (2) An object completion module that reconstructs partial geometries after discovering objects. We conduct extensive experiments on both real‑world and synthetic datasets, demonstrating superior 3D object segmentation performance over all baselines while achieving state‑of‑the‑art results.

Abstract:
Pursuing training‑free open‑vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep‑seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP‑based paradigm and harnesses the recent spatially‑aware dino.txt framework to facilitate more efficient and high‑quality dense prediction. While dino.txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross‑modal interactions. To address this, we introduce Visual‑guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dino.txt, unleashing its potential for fine‑grained object perception. Towards this end, VIP integrates alias expansion with a visual‑guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency‑aware manner to yield a high‑fidelity prediction. Extensive evaluations demonstrate that VIP: 1. surpasses the top‑leading methods by 1.4%‑8.4% average mIoU, 2. generalizes well to diverse challenging domains, and 3. requires marginal inference time and memory overhead.

Abstract:
The performance of promptable video object segmentation (PVOS) models substantially degrades under input corruptions, which prevents PVOS deployment in safety‑critical domains. This paper offers the first comprehensive study on robust PVOS (RobustPVOS). We first construct a new, comprehensive benchmark with two real‑world evaluation datasets of 351 video clips and more than 2,500 object masks under real‑world adverse conditions. At the same time, we generate synthetic training data by applying diverse and temporally varying corruptions to existing VOS datasets. Moreover, we present a new RobustPVOS method, dubbed Memory‑object‑conditioned Gated‑rank Adaptation (MoGA). The key to successfully performing RobustPVOS is two‑fold: effectively handling object‑specific degradation and ensuring temporal consistency in predictions. MoGA leverages object‑specific representations maintained in memory across frames to condition the robustification process, which allows the model to handle each tracked object differently in a temporally consistent way. Extensive experiments on our benchmark validate MoGA's efficacy, showing consistent and significant improvements across diverse corruption types on both synthetic and real‑world datasets, establishing a strong baseline for future RobustPVOS research. Our benchmark is publicly available at https://sohyun‑l.github.io/RobustPVOS_project_page/.

Abstract:
Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high‑level driving tasks such as traffic‑rule extraction and driving‑behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high‑level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi‑condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi‑condition generation and provides an important step toward mitigating data scarcity in high‑level autonomous‑driving tasks.

Abstract:
Open‑vocabulary semantic segmentation requires adapting image‑level vision‑language models such as CLIP to dense pixel‑level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine‑tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open‑vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state‑of‑the‑art performance over prior methods.

Abstract:
Semantic segmentation of large‑scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view‑dependent geometric distortion in image observations complicate cross‑modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality‑specific. This insight motivates a unified multimodal framework for joint 2D‑3D semantic segmentation. We combine a SAM‑based vision encoder with a SPTNet‑based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention‑based fusion module aggregates the shared features into a consistent cross‑modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross‑domain evaluation on nuScenes USA‑Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD‑Shift.

Abstract:
Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour‑intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning‑based interpretation is bottlenecked by the severe scarcity of expert‑annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine‑grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo‑interpretation for high‑resolution, millimetre‑level aerial imagery. Importantly, we leverage the large‑scale vision‑language Nano Banana Pro model to simultaneously generate high‑fidelity images and their corresponding pixel‑aligned semantic masks from prompts. We introduce WilDReF‑Q‑V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real‑world data with AI‑generated images, highlighting that AI‑generated data is highly complementary to real‑world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt‑generated data significantly improve performance for underrepresented species, some of which saw per‑species F1 score gains of up to 30 %pt. We conclude that vision‑language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab‑ulaval.github.io/gen4regen.

Abstract:
Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine‑grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time‑dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position‑Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel‑level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state‑of‑the‑art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher F_β^ω measure and 43% lower MAE (\mathcalM) on the DIS‑TE test set. The code is available at: https://github.com/Picsart‑AI‑Research/FlowDIS

Abstract:
Recovering 3D human pose and shape from a single image remains a cornerstone of human‑centric vision, yet most methods assume adult subjects and optimize each person independently. These assumptions fail in real‑world, all‑age scenes, where body proportions and depth must be resolved jointly. We introduce Anny‑Fit, a multi‑person, camera‑space optimization framework for all‑age 3D human mesh recovery (HMR). Unlike existing per‑person fitting methods, Anny‑Fit jointly optimizes all individuals directly in the camera coordinate system, enforcing global spatial consistency. At the core of our approach is the use of multiple forms of expert knowledge ‑‑ including metric depth maps, instance segmentation, 2D keypoints, and, VLM‑derived semantic attributes such as age and gender ‑‑ each obtained from dedicated off‑the‑shelf networks. These complementary signals jointly guide the optimization, constraining the depth‑scale ambiguity characteristic of all‑age scenes. Across diverse datasets, Anny‑Fit consistently improves 2D reprojection accuracy (+13 to 16), relative depth ordering (+6 to 7), 3D estimation error (‑9 to ‑29) and shape estimation (+25 to +82), producing more coherent scenes. Finally, we show that VLM‑based semantic knowledge can be distilled into an HMR model via the pseudo‑ground‑truth annotations produced by Anny‑Fit on training data, enabling it to learn semantically meaningful shape parameters while improving HMR performance. Our approach bridges adult‑only and all‑age modeling by enabling zero‑shot adaptation of adult‑trained HMR pipelines to the full age spectrum without retraining. Code is publicly available at https://github.com/naver/anny‑fit.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) with image‑level labels typically leverages Class Activation Maps (CAMs) to achieve pixel‑level predictions. Recently, Contrastive Language‑Image Pre‑training (CLIP) has been introduced to generate CAMs in WSSS. However, previous WSSS methods solely adopt CLIP's vision‑language paired property for dense localization, neglecting its inherently limited dense knowledge across both visual and text modalities, which renders CAM generation suboptimal. In this work, we propose DiCLIP, a novel WSSS framework that leverages the generative diffusion model to enhance CLIP's dense knowledge across two modalities. Specifically, Visual Correlation Enhancement (VCE) and Text Semantic Augmentation (TSA) modules are proposed for dense prediction enhancement. To improve the spatial awareness of visual features, our VCE module utilizes diffusion's reliable spatial consistency to mitigate the over‑smoothing issue in CLIP's attention. It designs the Attention Clustering Refinement (ACR) module to reliably extract diverse correlation maps from the diffusion model. The correlation maps act as a diversity bias for CLIP's self‑attention, recursively pushing its visual features towards a more discriminative dense distribution. To augment the semantics of text embeddings, our TSA module argues that a single text modality is insufficient to encompass the variability of visual categories. Thus, we leverage diffusion's generative power to maintain a dynamic key‑value cache model, shifting CAM generation from a patch‑text matching mechanism to a novel visual knowledge retrieval paradigm. With these enhancements, DiCLIP not only outperforms state‑of‑the‑art methods on PASCAL VOC and MS COCO but also significantly reduces training costs. Code is publicly available at https://github.com/zwyang6/DiCLIP.

Abstract:
We introduce Ilov3Splat, a novel framework for instance‑level open‑vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D‑GS). Most prior work depends on 2D rendering‑based matching or point‑level semantic association, which undermines cross‑view consistency, lacks coherent instance‑level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view‑consistent feature fields. Specifically, we leverage multi‑resolution hash embedding to efficiently encode language‑aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine‑grained object distinction across views. At inference time, CLIP‑encoded queries are matched against the learned features, followed by two‑stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open‑vocabulary 3D‑GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language‑driven 3D scene understanding. Project page: https://csiro‑robotics.github.io/Ilov3Splat.

Abstract:
Rising global food demand and growing climate pressure increase the need for sustainable, precise agricultural practices. Automated, individualized plant treatment relies on fine‑grained visual analysis, yet leaf‑level segmentation remains underexplored despite its value for assessing crop health, growth dynamics, yield potential and localized stress symptoms. Progress is limited by a lack of dedicated datasets, especially regarding species coverage, and by the absence of systematic evaluations of modern instance‑segmentation architectures for this task. We address these gaps by surveying current data and identifying four suitable, publicly available leaf‑segmentation datasets. Using them, we compare one‑stage, two‑stage and Transformer‑based detectors and identify a YOLO26 model configuration to provide the best trade‑off for real‑world precision‑agriculture tasks. Extensive cross‑domain generalization experiments reveal substantial performance drops across plant species and recording setups, especially for models trained solely on laboratory data. To strengthen data availability, we introduce a new benchmark dataset with leaf‑level masks for 23 plant species, created via semi‑automatic annotation of selected CropAndWeed images. A model trained on all four existing datasets achieves a mean mAP50‑95 of 83.9% across their corresponding test sets and 40.2% on our new benchmark, demonstrating improved generalization and highlighting the need for diverse leaf‑segmentation datasets in robust precision agriculture.

Abstract:
Open‑vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training‑free methods commonly rely on multi‑view fusion of semantic embeddings into a 3D map, either at the instance‑level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D‑to‑3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual‑layer semantic mapping method that jointly maintains both dense and instance‑level open‑vocabulary layers within a shared voxel map. This design enables further voxel‑level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross‑layer fusion approach improves the quality of both the instance‑level and dense layers, while also enabling a scalable and highly accurate instance‑level map where the dense layer and cross‑layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large‑scale scenes show that FUS3DMaps achieves accurate open‑vocabulary semantic mapping at multi‑story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.

Abstract:
Attacking semantic segmentation models is significantly harder than image classification models because an attacker must flip thousands of pixel predictions simultaneously. Standard pixel‑wise cross‑entropy (CE) is ill‑suited to this setting: it tends to overemphasize already‑misclassified pixels, which slows optimization and overstates model robustness. To address these issues, we introduce TsallisPGD, an adversarial attack built on the Tsallis cross‑entropy, a generalization of CE parameterized by q, which adaptively reshapes the gradient landscape by controlling gradient concentration across pixels. By varying q, we steer the attack toward pixels at different confidence levels. We first show that no single fixed‑q is universally optimal, as its effectiveness depends on the dataset, model architecture, and perturbation budget. Motivated by this, we propose a dynamic q‑schedule that sweeps q during optimization. Extensive experiments on Cityscapes, Pascal VOC, and ADE20K show that TsallisPGD, using a single validation‑selected schedule, achieves the best average attack rank across all evaluated settings and improves over CEPGD, SegPGD, CosPGD, JSPGD, and MaskedPGD in reducing accuracy and mIoU on both standard and robust models.

Abstract:
The remote sensing (RS) domain suffers from a lack of densely labeled datasets, which are costly to obtain. Thus, models that can segment RS imagery well without supervised fine‑tuning are valuable, but existing solutions fall behind supervised methods. Recently, DINOv3 surpassed SOTA RS foundation models on the GEO‑bench segmentation benchmark without pre‑training on RS data. Additionally, DINO.txt has enabled open vocabulary semantic segmentation (OVSS) with the DINOv3 backbone. We leverage these developments to form an OVSS model for RS imagery, free of RS‑domain fine‑tuning. Our model, CAFe‑DINO (Cost Aggregation + Feature Upsampling with DINO) exploits the strong OVSS performance of DINOv3 for RS imagery via cost aggregation and training‑free upsampling of text‑image similarity scores. The robust latent of the DINOv3 backbone eliminates the need for fine‑tuning on RS imagery; we instead fine‑tune our model on a RS‑targeted subset of COCO‑Stuff. CAFe‑DINO achieves state‑of‑the‑art performance on key RS segmentation datasets, outperforming OVSS methods fine‑tuned on RS data. Our code and data are publicly available at https://github.com/rfaulk/DINO_Soars.

Abstract:
In computational pathology, nuclear instance segmentation is a fundamental task with many downstream clinical applications. With the advent of deep learning, many approaches, including convolutional neural networks (CNNs) and vision transformers (ViTs), have been proposed for this task, along with both machine learning‑based and non‑machine learning‑based pre‑ and post‑processing techniques to further boost performance. However, one fundamental aspect that has received less attention is the evaluation pipeline. In this study, we identify four key issues associated with nuclear instance segmentation evaluation and propose corresponding solutions. Our proposed modifications, namely handling vague regions, score normalization, overlapping instances, and border uncertainty, are integrated into a unified framework called NucEval, which enables robust evaluation of nuclear instance segmentation. We evaluate this pipeline using the NuInsSeg dataset, which provides unique characteristics that make it particularly suitable for this study, as well as two additional external datasets, with three CNN‑ and ViT‑based nuclear instance segmentation models, to demonstrate the impact of these modifications on instance segmentation metrics. The code, along with complete guidelines and illustrative examples, is publicly available at: https://github.com/masih4/nuc_eval.

Abstract:
Vision‑language‑action (VLA) models typically rely on large‑scale real‑world videos, whereas simulated data, despite being inexpensive and highly parallelizable to collect, often suffers from a substantial visual domain gap and limited environmental diversity, resulting in weak real‑world generalization. We present an efficient video augmentation framework that converts simulated VLA videos into realistic training videos while preserving task semantics and action trajectories. Our pipeline extracts structured conditions from simulation via video semantic segmentation and video captioning, rewrites captions to diversify environments, and uses a conditional video transfer model to synthesize realistic videos. To make augmentation practical at scale, we introduce a diffusion feature‑reuse mechanism that reuses video tokens across adjacent timesteps to accelerate generation, and a coreset sampling strategy that identifies a compact, non‑redundant subset for augmentation under limited computation. Extensive experiments on Robotwin 2.0, LIBERO, LIBERO‑Plus, and a real robotic platform demonstrate consistent improvements. For example, our method improves RDT‑1B by 8% on Robotwin 2.0, and boosts π_0 by 5.1% on the more challenging LIBERO‑Plus benchmark. Code is available at: https://github.com/nanfangxiansheng/Seeing‑Realism‑from‑Simulation.

Abstract:
Vision Foundation Models (VFMs) pretrained on large‑scale RGB data have demonstrated remarkable representation quality, yet their applicability to multispectral imaging spanning Near‑Infrared (NIR), Short‑Wave Infrared (SWIR), and Long‑Wave Infrared (LWIR) remains largely unexplored. These spectral modalities offer complementary sensing capabilities critical for robust perception in adverse conditions, but present a fundamental domain gap relative to RGB‑centric pretrained models. We present SpectraDINO, a multispectral VFM that bridges this spectral gap by extending DINOv2 ViT backbones to beyond‑visible modalities through lightweight, per‑modality bottleneck adapters, while preserving the rich representations of the frozen RGB backbone. We introduce a multi‑stage teacher‑student training protocol in which a frozen DINOv2 teacher guides a spectral student via cosine distillation, symmetric contrastive loss, patch‑level alignment, and a novel neighborhood‑structure‑preservation loss. This staged curriculum enables strong cross‑modal alignment without catastrophic forgetting of RGB priors. We evaluate SpectraDINO on multispectral object detection and semantic segmentation across challenging NIR, SWIR, and LWIR benchmarks using widely adopted fusion strategies. SpectraDINO achieves state‑of‑the‑art performance across most benchmarks, validating its effectiveness as a general‑purpose backbone for spectral generalization. The code and weights for model variants are available at https://github.com/Yonsei‑STL/SpectraDINO.

Abstract:
Fine‑grained RGBT image semantic segmentation is crucial for all‑weather unmanned aerial vehicle (UAV) scene understanding. However, UAV RGBT image semantic segmentation faces two coupled challenges: cross‑modal spatial misalignment caused by sensor parallax and platform vibration, and severe semantic confusion among fine‑grained ground objects under top‑down aerial views. To address these issues, we propose a Graph‑based Semantic Calibration Network (GSCNet) for unaligned UAV RGBT image semantic segmentation. Specifically, we design a Feature Decoupling and Alignment Module (FDAM) that decouples each modality into shared structural and private perceptual components and performs deformable alignment in the shared subspace, enabling robust spatial correction with reduced modality appearance interference. Moreover, we propose a Semantic Graph Calibration Module (SGCM) that explicitly encodes the hierarchical taxonomy and co‑occurrence regularities among ground‑object categories in UAV scenes into a structured category graph, and incorporates these priors into graph‑attention reasoning to calibrate predictions of visually similar and rare categories. In addition, we construct the Unaligned RGB‑Thermal Fine‑grained (URTF) benchmark, to the best of our knowledge, the largest and most fine‑grained benchmark for unaligned UAV RGBT image semantic segmentation, containing over 25,000 image pairs across 61 semantic categories with realistic cross‑modal misalignment. Extensive experiments on URTF demonstrate that GSCNet significantly outperforms state‑of‑the‑art methods, with notable gains on fine‑grained categories. The dataset is available at https://github.com/mmic‑lcl/Datasets‑and‑benchmark‑code.

Abstract:
Over the past decade, hyperspectral image (HSI) classification has drawn considerable interest due to HSIs' ability to effectively distinguish terrestrial objects by capturing detailed, continuous spectral information. The strong performance of recent deep learning techniques in tasks like image classification and semantic segmentation has led to their growing use in HSI classification, due to their ability to capture complex spatial and spectral features more effectively than traditional methods. This paper presents MixerCA, a novel lightweight model for HSI classification that leverages depthwise convolution and a self‑attention mechanism. MixerCA integrates depth‑wise convolutions, token and channel mixing, and coordinate attention into a unified structure to decouple spatial and channel interactions, maintain consistent resolution throughout the network, and directly process HSI patches. Extensive experiments on four hyperspectral benchmark datasets reveal MixerCA's clear advantages over several competing algorithms, including 2D‑CNN, 3D‑CNN, Tri‑CNN, HybridSN, ViT, and Swin Transformer. The source code is publicly available at https://github.com/mqalkhatib/MixerCA.

Abstract:
Worldwide image geo‑localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post‑processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two‑stage framework for worldwide image geo‑localization. First, it establishes a geo‑representational foundation by fusing image and semantic segmentation features via bidirectional cross‑attention. The fused features are then aligned with GPS coordinates through dual‑view contrastive learning to build a global retrieval database. Second, it performs geo‑cognitive refinement by re‑ranking retrieved candidates using geographic clustering. It then feeds them into large multimodal models (LMMs) for final coordinate prediction. Experiments on IM2GPS, IM2GPS3k, and YFCC4k show that DualGeo outperforms state‑of‑the‑art methods, improving street‑level (<1 km) and city‑level (<25 km) localization accuracy by 3.6%‑16.58% and 1.29%‑8.77%, respectively. Our code and datasets are available : https://github.com/CJ310177/DualGeo.

Abstract:
Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non‑shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow‑Aware and Removal Unified (SARU) Framework , a cohesive two‑stage framework. First, its dual‑branch detection module (DBCSF‑Net) fuses multi‑color space and semantic features to generate high‑fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training‑free physical algorithm (N^2SGSR) restores illumination by transferring properties from adjacent non‑shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single‑image Shadow Removal Benchmark (SiSRB). Extensive experiments on the AISD and RSISD datasets demonstrate that SARU achieves SOTA shadow detection performance. For shadow removal, our training‑free N^2SGSR algorithm attains an average processing speed of approximately 1.3s, which is over 10 times faster than the SOTA MAOSD while maintains an SRI value close to 0.9 on both the AISD and SiSRB datasets, a level comparable to the advanced RS‑GSSR method. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real‑world RSI analysis. The code and datasets are publicly available at: https://github.com/AeroVILab‑AHU/SARU

Abstract:
In pathology, the spatial distribution and proportions of tissue types are key indicators of disease progression, and are more readily available than fine‑grained annotations. However, these assessments are rarely mapped to pixel‑wise segmentation. The task is fundamentally underdetermined, as many spatially distinct segmentations can satisfy the same global proportions in the absence of pixel‑wise constraints. To address this, we introduce Variational Segmentation from Label Proportions (VSLP), a two‑stage framework that infers dense segmentations from global label proportions, without any pixel‑level annotations. This framework first leverages a pre‑trained transformer model with test‑time augmentation to produce a pixel‑wise confidence estimate. In the second stage, these estimates are fused by solving a variational optimization problem that incorporates a Wasserstein data fidelity term alongside a learned regularizer. Unlike end‑to‑end networks, our variational method can visualize the fidelity‑regularization energy, resulting in more interpretable segmentation. We validate our approach on two public datasets, achieving superior performance over existing weakly supervised and unsupervised methods. For one of these datasets, proportions have been estimated by an experienced pathologist to provide a realistic benchmark to the community. Furthermore, the method scales to an in‑house dataset with noisy pathologist labels, severely outperforming state‑of‑the‑art methods, thereby demonstrating practical applicability. The code and data will be made publicly available upon acceptance at https://github.com/xiaoliangpi/VSLP.

Abstract:
Semantic segmentation of multi‑modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi‑modal approaches mainly focus on integrating complementary visual modalities, yet neglect the incorporating of non‑visual textual data ‑ a rich source of knowledge that can bridge semantic gaps between visual patterns and real‑world concepts. To address this limitation, we propose TSMNet, a text supervised multi‑modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open‑vocabulary semantic segmentation. Unlike conventional multi‑modal segmentation frameworks, TSMNet introduces a dual‑branch text encoder to extract both scene‑level semantic and object‑level label information from various textual data, enabling dynamic cross‑modal fusion. These text‑derived features dynamically interact with visual embeddings through the proposed text‑guided visual semantic fusion module, enabling domain‑aware feature refinement and human‑interpretable decision‑making. To verify our method, we innovatively construct two new multi‑modal datasets, and carry out extensive experiments to make a comprehensive comparison between the proposed method and other state‑of‑the‑art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor‑specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. The source code will be available at https://github.com/yeyuanxin110/TSMNet

Abstract:
Understanding the surrounding environment is fundamental in autonomous driving and robotic perception. Distinguishing between known classes and previously unseen objects is crucial in real‑world environments, as done in Anomaly Segmentation. However, research in the 3D field remains limited, with most existing approaches applying post‑processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. Moreover, the only publicly available 3D LiDAR anomaly segmentation dataset contains simple scenarios, with few anomaly instances, and exhibits a severe domain gap due to its sensor resolution. To bridge this gap, we introduce a set of mixed real‑synthetic datasets for 3D LiDAR anomaly segmentation, built upon established semantic segmentation benchmarks, with multiple out‑of‑distribution objects and diverse, complex environments. Extensive experiments demonstrate that our approach achieves state‑of‑the‑art and competitive results on the existing real‑world dataset and the newly introduced mixed datasets, respectively, validating the effectiveness of our method and the utility of the proposed datasets. Code and datasets are available at https://simom0.github.io/lido‑page/.

Abstract:
High‑performance semantic segmentation has achieved significant progress in recent years, often driven by increasingly large backbones and higher computational budgets. While effective, such approaches introduce substantial computational overhead and limit accessibility under constrained hardware settings. In this paper, we propose DGM‑Net (Directional Geometric Mamba Network), an efficient architecture that improves modeling capability through structural design rather than increasing model capacity. We introduce Directional Geometric Mamba (G‑Mamba), a linear‑complexity O(N) operator as an alternative to conventional context modeling modules such as ASPP and PPM. To further enhance structural awareness in state space model (SSM)‑based modeling, we design the DGM‑Module, which extracts centripetal flow fields and topological skeletons to guide the scanning process and improve boundary preservation. Without relying on large‑scale pretraining or heavy backbone scaling, DGM‑Net achieves 80.8% mIoU within 28k iterations, 82.3% mIoU on Cityscapes test set, and 45.24% mIoU on ADE20K. In addition, the model maintains stable performance under constrained hardware settings (e.g., batch size of 2 on 8GB VRAM), highlighting its efficiency and practicality. These results demonstrate that incorporating geometric guidance into SSM‑based architectures provides an effective and resource‑efficient direction for semantic segmentation.

Abstract:
Reliable object perception is necessary for general‑purpose service robots. Open‑vocabulary detectors struggle to generalize beyond a few classes and fully supervised training of object detectors requires time‑intensive annotations. We present a semi‑supervised label propagation approach for household object segmentation. A segment proposer generates class‑agnostic masks, and an ensemble of Hopfield networks assigns labels by learning representative embeddings in complementary foundation model embedding spaces (CLIP, ViT, Theia). Our approach scales to 50 object classes with limited annotation overhead and can automatically label 60% of the data in a RoboCup@Home setting, where preparation time is severely constrained. Dataset and code are publicly available at https://github.com/ais‑bonn/label_propagation.

Abstract:
With the accumulation of resources in the era of big data and the rise of pre‑trained models in deep learning, optimizing neural networks for various tasks often involves different strategies for fine‑tuning pre‑trained models versus training from scratch. However, existing optimizers primarily focus on reducing the loss function by updating model parameters, without fully addressing the unique demands of these two major paradigms. In this paper, we propose DualOpt, a novel approach that decouples optimization techniques specifically tailored for these distinct training scenarios. For training from scratch, we introduce real‑time layer‑wise weight decay, designed to enhance both convergence and generalization by aligning with the characteristics of weight updates and network architecture. For more importantly fine‑tuning, we integrate weight rollback with the optimizer, incorporating a rollback term into each weight update step. This ensures consistency in the weight distribution between upstream and downstream models, effectively mitigating knowledge forgetting and improving fine‑tuning performance. Additionally, we extend the layer‑wise weight decay to dynamically adjust the rollback levels across layers, adapting to the varying demands of different downstream tasks. Extensive experiments across diverse tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrate the broad applicability and state‑of‑the‑art performance of DualOpt. Code is available at https://github.com/qklee‑lz/OLOR‑AAAI‑2024.

Abstract:
We propose Semantic‑Fast‑SAM (SFS), a semantic segmentation framework that combines the Fast Segment Anything model with a semantic labeling pipeline to achieve real‑time performance without sacrificing accuracy. FastSAM is an efficient CNN‑based re‑implementation of the Segment Anything Model (SAM) that runs much faster than the original transformer‑based SAM. Building upon FastSAM's rapid mask generation, we integrate a Semantic‑Segment‑Anything (SSA) labeling strategy to assign meaningful categories to each mask. The resulting SFS model produces high‑quality semantic segmentation maps at a fraction of the computational cost and memory footprint of the original SAM‑based approach. Experiments on Cityscapes and ADE20K benchmarks demonstrate that SFS matches the accuracy of prior SAM‑based methods (mIoU ~ 70.33 on Cityscapes and 48.01 on ADE20K) while achieving approximately 20x faster inference than SSA in the closed‑set setting. We also show that SFS effectively handles open‑vocabulary segmentation by leveraging CLIP‑based semantic heads, outperforming recent open‑vocabulary models on broad class labeling. This work enables practical real‑time semantic segmentation with the "segment‑anything" capability, broadening the applicability of foundation segmentation models in robotics scenarios. The implementation is available at https://github.com/KBH00/Semantic‑Fast‑SAM.

Abstract:
Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose‑SAM to microstructures and integrates its topology‑aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U‑Net), an adaptive‑prompting vision foundation model (MatSAM) and a contemporary vision‑language model (Qwen2.5‑VL‑7B). Our evaluations reveal that while the out‑of‑the‑box vision‑language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over‑segmentation despite its domain‑specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few‑shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50‑grain sampling minimum. These results highlight the efficacy of application‑level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at https://github.com/mueez‑overflow/ASTM‑Grain‑Size‑Estimator.

Abstract:
Despite recent progress, vision‑language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open‑vocabulary semantic segmentation; and (2) high token counts for fine‑grained visual representations, which limits scalability to long videos. This work addresses both limitations. We propose T‑REN (Text‑aligned Region Encoder Network), an efficient encoder that maps visual data to a compact set of text‑aligned region‑level representations (or region tokens). T‑REN achieves this through a lightweight network added on top of a frozen vision backbone, trained to pool patch‑level representations within each semantic region into region tokens and align them with region‑level text annotations. With only 3.7% additional parameters compared to the vision‑language backbone, this design yields substantially stronger dense cross‑modal understanding while reducing the token count by orders of magnitude. Specifically, T‑REN delivers +5.9 mIoU on ADE20K open‑vocabulary segmentation, +18.4% recall on COCO object‑level text‑image retrieval, +15.6% recall on Ego4D video object localization, and +17.6% mIoU on VSPW video scene parsing, all while reducing token counts by more than 24x for images and 187x for videos compared to the patch‑based vision‑language backbone. The code and model are available at https://github.com/savya08/T‑REN.

Abstract:
Mixture‑of‑Experts (MoE) models provide a structured approach to combining specialized neural networks and offer greater interpretability than conventional ensembles. While MoEs have been successfully applied to image classification and semantic segmentation, their use in object detection remains limited due to challenges in merging dense and structured predictions. In this work, we investigate model‑level mixtures of object detectors and analyze their suitability for improving performance and interpretability in object detection. We propose an MoE architecture that combines YOLO‑based detectors trained on semantically disjoint data subsets, with a learned gating network that dynamically weights expert contributions. We study different strategies for fusing detection outputs and for training the gating mechanism, including balancing losses to prevent expert collapse. Experiments on the BDD100K dataset demonstrate that the proposed MoE consistently outperforms standard ensemble approaches and provides insights into expert specialization across domains, highlighting model‑level MoEs as a viable alternative to traditional ensembling for object detection. Our code is available at https://github.com/KASTEL‑MobilityLab/mixtures‑of‑experts/.

Abstract:
Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel‑level mask annotations. To tackle it, weakly‑supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor‑intensive. In this paper, we design a novel weakly‑supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform bi‑directional vision‑language feature selection and interaction to enable fine‑grained multimodal alignment. Next, we propose an instance‑aware expression classification scheme to optimize the model in distinguishing positive from negative expressions. Also, we introduce a positive‑prediction fusion strategy to generate high‑quality pseudo‑masks, which serve as additional supervision to the model. Last, we design a temporal segment ranking constraint such that the overlaps between mask predictions of temporally neighboring frames are required to conform to specific orders. Extensive experiments on four publicly available RVOS datasets, including A2D Sentences, J‑HMDB Sentences, Ref‑YouTube‑VOS, and Ref‑DAVIS17, demonstrate the superiority of our method. Code is available at https://github.com/viscom‑tongji/WSRVOS.

Abstract:
Semantic segmentation in hyperbolic space enables compact modeling of hierarchical structure while providing inherent uncertainty quantification. Prior approaches predominantly rely on the Poincaré ball model, which suffers from numerical instability, optimization, and computational challenges. We propose a novel, tractable, architecture‑agnostic semantic segmentation framework (pixel‑wise and mask classification) in the hyperbolic Lorentz model. We employ text embeddings with semantic and visual cues to guide hierarchical pixel‑level representations in Lorentz space. This enables stable and efficient optimization without requiring a Riemannian optimizer, and easily integrates with existing Euclidean architectures. Beyond segmentation, our approach yields free uncertainty estimation, confidence map, boundary delineation, hierarchical and text‑based retrieval, and zero‑shot performance, reaching generalized flatter minima. We introduce a novel uncertainty and confidence indicator in Lorentz cone embeddings. Further, we provide analytical and empirical insights into Lorentz optimization via gradient analysis. Extensive experiments on ADE20K, COCO‑Stuff‑164k, Pascal‑VOC, and Cityscapes, utilizing state‑of‑the‑art per‑pixel classification models (DeepLabV3 and SegFormer) and mask classification models (mask2former and maskformer), validate the effectiveness and generality of our approach. Our results demonstrate the potential of hyperbolic Lorentz embeddings for robust and uncertainty‑aware semantic segmentation. Code is available at https://github.com/mxahan/Lorentz_semantic_segmentation.

Abstract:
Multimodal remote sensing data provide complementary information for semantic segmentation, but in real‑world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade‑off by compromising modality‑specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC‑SLP, a multimodal semantic segmentation model designed to preserve both modality‑invariant and modality‑specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub‑optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality‑specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC‑SLP consistently outperforms state‑of‑the‑art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral‑Semantic‑Segmentation‑via‑Structured‑Latent‑Projection‑CBC‑SLP‑.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have enabled highly efficient and photorealistic novel view synthesis. However, segmenting objects accurately in 3DGS remains challenging due to the discrete nature of Gaussian representations, which often leads to aliasing and artifacts at object boundaries. In this paper, we introduce NG‑GS, a novel framework for high‑quality object segmentation in 3DGS that explicitly addresses boundary discretization. Our approach begins by automatically identifying ambiguous Gaussians at object boundaries using mask variance analysis. We then apply radial basis function (RBF) interpolation to construct a spatially continuous feature field, enhanced by multi‑resolution hash encoding for efficient multi‑scale representation. A joint optimization strategy aligns 3DGS with a lightweight NeRF module through alignment and spatial continuity losses, ensuring smooth and consistent segmentation boundaries. Extensive experiments on NVOS, LERF‑OVS, and ScanNet benchmarks demonstrate that our method achieves state‑of‑the‑art performance, with significant gains in boundary mIoU. Code is available at https://github.com/BJTU‑KD3D/NG‑GS.

Abstract:
Sparse mixture‑of‑experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed‑forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine‑grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch‑wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder‑decoder and backbone‑based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture‑dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN‑based dense prediction. Our code is available at https://github.com/KASTEL‑MobilityLab/moe‑layers/.

Abstract:
Instance‑level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel‑level matching. While recent geometry‑aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel‑level projection drift, even when their internal object‑level attention remains consistent. To bridge this gap, we introduce VGGT‑Segmentor (VGGT‑S), a framework that unifies robust geometric modeling with pixel‑accurate semantic segmentation. VGGT‑S leverages VGGT's powerful cross‑view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point‑guided prediction, and iterative mask refinement, effectively translating high‑level feature alignment into a precise segmentation mask. Furthermore, we propose a single‑image self‑supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego‑Exo4D benchmark, VGGT‑S sets a new state‑of‑the‑art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence‑free pretrained model surpasses most fully‑supervised baselines, demonstrating the effectiveness and scalability of our approach. Code is publicly available at: https://github.com/buaa‑colalab/VGGT‑S.

Abstract:
The robustness of machine learning models can be compromised by spurious correlations between non‑causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non‑causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label‑flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within‑object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped‑identity pixels, and missed‑to‑background pixels. We also propose an entropy‑based, ground truth label‑free `flip‑risk' score, which is computed from foreground identity uncertainty, and show that it can flag flip‑prone cases at inference time. Code is available at https://github.com/acharaakshit/label‑flips.

Abstract:
4D point cloud videos capture rich spatial and temporal dynamics of scenes which possess unique values in various 4D understanding tasks. However, most existing methods work in the spatiotemporal domain where the underlying geometric characteristics of 4D point cloud videos are hard to capture, leading to degraded representation learning and understanding of 4D point cloud videos. We address the above challenge from a complementary spectral perspective. By transforming 4D point cloud videos into graph spectral signals, we can decompose them into multiple frequency bands each of which captures distinct geometric structures of point cloud videos. Our spectral analysis reveals that the decomposed low‑frequency signals capture more coarse shapes while high‑frequency signals encode more fine‑grained geometry details. Building on these observations, we design Spatio‑Temporal‑Spectral Mixer (STS‑Mixer), a unified framework that mixes spatial, temporal, and spectral representations of point cloud videos. STS‑Mixer integrates multi‑band delineated spectral signals with spatiotemporal information to capture rich geometries and temporal dynamics, while enabling fine‑grained and holistic understanding of 4D point cloud videos. Extensive experiments show that STS‑Mixer achieves superior performance consistently across multiple widely adopted benchmarks on both 3D action recognition and 4D semantic segmentation tasks. Code and models are available at https://github.com/Vegetebird/STS‑Mixer.

Abstract:
Recent semantic 3D Gaussian Splatting (3DGS) methods primarily rely on 2D foundation models, often yielding ambiguous boundaries and limited support for structured urban semantics. While city models such as CityGML encode hierarchically organized semantics together with building geometry, these labels cannot be directly mapped to Gaussian primitives. We present GS4City, a hierarchical semantic Gaussian Splatting method that incorporates city‑model priors for urban scene understanding. GS4City derives reliable image‑aligned masks from Level of Detail (LoD) 3 CityGML models via two‑pass raycasting, explicitly using parent‑child relations to validate and recover fine‑grained facade elements. It then fuses these geometry‑grounded masks with foundation‑model predictions to establish scene‑consistent instance correspondences, and learns a compact identity encoding for each Gaussian under joint 2D identity supervision and 3D spatial regularization. Experiments on the TUM2TWIN and Gold Coast datasets show that GS4City effectively incorporates structured building semantics into Gaussian scene representations, outperforming existing 2D‑driven semantic 3DGS baselines, including LangSplat and Gaga, by up to 15.8 IoU points in coarse building segmentation and 14.2 mIoU points in fine‑grained semantic segmentation. By bridging structured city models and photorealistic Gaussian scene representations, GS4City enables semantically queryable and structure‑aware urban reconstruction. Code is available at https://github.com/Jinyzzz/GS4City.

Abstract:
Change detection is a fundamental task in remote sensing, aiming to quantify the impacts of human activities and ecological dynamics on land‑cover changes. Existing change detection methods are limited to predefined classes in training datasets, which constrains their scalability in real‑world scenarios. In recent years, numerous advanced open‑vocabulary semantic segmentation models have emerged for remote sensing imagery. However, there is still a lack of an effective framework for directly applying these models to open‑vocabulary change detection (OVCD), a novel task that integrates vision and language to detect changes across arbitrary categories. To address these challenges, we first construct a category‑agnostic change detection dataset, termed CA‑CDD. Further, we design a category‑agnostic change head to detect the transitions of arbitrary categories and index them to specific classes. Based on them, we propose Seg2Change, an adapter designed to adapt open‑vocabulary semantic segmentation models to change detection task. Without bells and whistles, this simple yet effective framework achieves state‑of‑the‑art OVCD performance (+9.52 IoU on WHU‑CD and +5.50 mIoU on SECOND). Our code is released at https://github.com/yogurts‑sy/Seg2Change.

Abstract:
Three‑dimensional (3D) point cloud analysis has become central to applications ranging from autonomous driving and robotics to forestry and ecological monitoring. Although numerous deep learning methods have been proposed for point cloud understanding, including supervised backbones, self‑supervised pre‑training (SSL), and parameter‑efficient fine‑tuning (PEFT), their implementations are scattered across incompatible codebases with differing data pipelines, evaluation protocols, and configuration formats, making fair comparisons difficult. We introduce \lib, a unified, extensible PyTorch library that integrates over 55 model configurations covering 29 supervised architectures, seven SSL pre‑training methods, and five PEFT strategies, all within a single registry‑based framework supporting classification, semantic segmentation, part segmentation, and few‑shot learning. \lib provides standardised training runners, cross‑validation with stratified K‑fold splitting, automated LaTeX/CSV table generation, built‑in Friedman/Nemenyi statistical testing with critical‑difference diagrams for rigorous multi‑model comparison, and a comprehensive test suite with 2\,200+ automated tests validating every configuration end‑to‑end. The code is available at https://github.com/said‑ohamouddou/LIDARLearn under the MIT licence.

Abstract:
Recent advances in vision foundation models have revolutionized geometry reconstruction and semantic understanding. Yet, most of the existing approaches treat these capabilities in isolation, leading to redundant pipelines and compounded errors. This paper introduces FF3R, a fully annotation‑free feed‑forward framework that unifies geometric and semantic reasoning from unconstrained multi‑view image sequences. Unlike previous methods, FF3R does not require camera poses, depth maps, or semantic labels, relying solely on rendering supervision for RGB and feature maps, establishing a scalable paradigm for unified 3D reasoning. In addition, we address two critical challenges in feedforward feature reconstruction pipelines, namely global semantic inconsistency and local structural inconsistency, through two key innovations: (i) a Token‑wise Fusion Module that enriches geometry tokens with semantic context via cross‑attention, and (ii) a Semantic‑Geometry Mutual Boosting mechanism combining geometry‑guided feature warping for global consistency with semantic‑aware voxelization for local coherence. Extensive experiments on ScanNet and DL3DV‑10K demonstrate FF3R's superior performance in novel‑view synthesis, open‑vocabulary semantic segmentation, and depth estimation, with strong generalization to in‑the‑wild scenarios, paving the way for embodied intelligence systems that demand both spatial and semantic understanding.

Abstract:
Semi‑supervised semantic segmentation in computational pathology remains challenging due to scarce pixel‑level annotations and unreliable pseudo‑label supervision. We propose UniSemAlign, a dual‑modal semantic alignment framework that enhances visual segmentation by injecting explicit class‑level structure into pixel‑wise learning. Built upon a pathology‑pretrained Transformer encoder, UniSemAlign introduces complementary prototype‑level and text‑level alignment branches in a shared embedding space, providing structured guidance that reduces class ambiguity and stabilizes pseudo‑label refinement. The aligned representations are fused with visual predictions to generate more reliable supervision for unlabeled histopathology images. The framework is trained end‑to‑end with supervised segmentation, cross‑view consistency, and cross‑modal alignment objectives. Extensive experiments on the GlaS and CRAG datasets demonstrate that UniSemAlign substantially outperforms recent semi‑supervised baselines under limited supervision, achieving Dice improvements of up to 2.6% on GlaS and 8.6% on CRAG with only 10% labeled data, and strong improvements at 20% supervision. Code is available at: https://github.com/thailevann/UniSemAlign

Abstract:
In this work, we present LIANet (Location Is All You Need Network), a coordinate‑based neural representation that models multi‑temporal spaceborne Earth observation (EO) data for a given region of interest as a continuous spatiotemporal neural field. Given only spatial and temporal coordinates, LIANet reconstructs the corresponding satellite imagery. Once pretrained, this neural representation can be adapted to various EO downstream tasks, such as semantic segmentation or pixel‑wise regression, importantly, without requiring access to the original satellite data. LIANet intends to serve as a user‑friendly alternative to Geospatial Foundation Models (GFMs) by eliminating the overhead of data access and preprocessing for end‑users and enabling fine‑tuning solely based on labels. We demonstrate the pretraining of LIANet across target areas of varying sizes and show that fine‑tuning it for downstream tasks achieves competitive performance compared to training from scratch or using established GFMs. The source code and datasets are publicly available at https://github.com/mojganmadadi/LIANet/tree/v1.0.1.

Abstract:
Weakly supervised semantic segmentation aims to achieve pixel‑level predictions using image‑level labels. Existing methods typically entangle semantic recognition and object localization, which often leads models to focus exclusively on sparse discriminative regions. Although foundation models show immense potential, many approaches still follow the tightly coupled optimization paradigm, struggling to effectively alleviate pseudo‑label noise and often relying on time‑consuming multi‑stage retraining or unstable end‑to‑end joint optimization. To address the above challenges, we present ModuSeg, a training‑free weakly supervised semantic segmentation framework centered on explicitly decoupling object discovery and semantic assignment. Specifically, we integrate a general mask proposer to extract geometric proposals with reliable boundaries, while leveraging semantic foundation models to construct an offline feature bank, transforming segmentation into a non‑parametric feature retrieval process. Furthermore, we propose semantic boundary purification and soft‑masked feature aggregation strategies to effectively mitigate boundary ambiguity and quantization errors, thereby extracting high‑quality category prototypes. Extensive experiments demonstrate that the proposed decoupled architecture better preserves fine boundaries without parameter fine‑tuning and achieves highly competitive performance on standard benchmark datasets. Code is available at https://github.com/Autumnair007/ModuSeg.

Abstract:
Few‑Shot Semantic Segmentation (FSS) focuses on segmenting novel object categories from only a handful of annotated examples. Most existing approaches rely on extensive episodic training to learn transferable representations, which is both computationally demanding and sensitive to distribution shifts. In this work, we revisit FSS from the perspective of modern vision foundation models and explore the potential of Segment Anything Model 3 (SAM3) as a training‑free solution. By repurposing its Promptable Concept Segmentation (PCS) capability, we adopt a simple spatial concatenation strategy that places support and query images into a shared canvas, allowing a fully frozen SAM3 to perform segmentation without any fine‑tuning or architectural changes. Experiments on PASCAL‑5^i and COCO‑20^i show that this minimal design already achieves state‑of‑the‑art performance, outperforming many heavily engineered methods. Beyond empirical gains, we uncover that negative prompts can be counterproductive in few‑shot settings, where they often weaken target representations and lead to prediction collapse despite their intended role in suppressing distractors. These findings suggest that strong cross‑image reasoning can emerge from simple spatial formulations, while also highlighting limitations in how current foundation models handle conflicting prompt signals. Code at: https://github.com/WongKinYiu/FSS‑SAM3

Abstract:
Surgical video segmentation is fundamental to computer‑assisted surgery. In practice, surgeons need to dynamically specify targets throughout extended procedures, using heterogeneous cues such as visual selections, textual expressions, or audio instructions. However, existing Promptable Video Object Segmentation (PVOS) methods are typically restricted to a single prompt modality and rely on coupled frameworks that cause optimization interference between target initialization and tracking. Moreover, these methods produce hallucinated predictions when the target is absent and suffer from accumulated mask drift without failure recovery. To address these challenges, we present UniSurgSAM, a unified PVOS model enabling reliable surgical video segmentation through visual, textual, or audio prompts. Specifically, UniSurgSAM employs a decoupled two‑stage framework that independently optimizes initialization and tracking to resolve the optimization interference. Within this framework, we introduce three key designs for reliability: presence‑aware decoding that models target absence to suppress hallucinations; boundary‑aware long‑term tracking that prevents mask drift over extended sequences; and adaptive state transition that closes the loop between stages for failure recovery. Furthermore, we establish a multi‑modal and multi‑granular benchmark from four public surgical datasets with precise instance‑level masklets. Extensive experiments demonstrate that UniSurgSAM achieves state‑of‑the‑art performance in real time across all prompt modalities and granularities, providing a practical foundation for computer‑assisted surgery. Code and datasets will be available at https://jinlab‑imvr.github.io/UniSurgSAM.

Abstract:
Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human‑robot interaction, yet existing datasets rarely capture real‑world indoor complexity at scale. We introduce IndoorCrowd, a multi‑scene dataset for indoor human detection, instance segmentation, and multi‑object tracking, collected across four campus locations (ACS‑EC, ACS‑EG, IE‑Central, R‑Central). It comprises 31 videos (9,913 frames at 5fps) with human‑verified, per‑instance segmentation masks. A 620‑frame control subset benchmarks three foundation‑model auto‑annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen's κ, AP, precision, recall, and mask IoU. A further 2,552‑frame subset supports multi‑object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT‑DETR‑L paired with ByteTrack, BoT‑SORT, and OC‑SORT. Per‑scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS‑EC, with 79.3% dense frames and a mean instance scale of 60.8px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.

Abstract:
Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian‑based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance‑level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio‑temporal Gaussian representation that jointly models human performance, high‑fidelity rendering, and instance‑level semantics. Our key insight is that embedding instance‑consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language‑aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry‑aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open‑vocabulary querying.

Abstract:
Semantic segmentation of low‑altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer‑based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine‑grained details in high‑resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype‑based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype‑based cross‑attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi‑scale feature extraction module that combines deformable convolutions (DConv) with context‑aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86% mIoU on UAVid and 80.92% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at https://github.com/zhangda1018/PBSeg.

Abstract:
This paper presents a new method for the zero‑shot open‑vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image‑text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state‑of‑the‑art for OVSS on nuScenes and SemanticKITTI. Code, pre‑trained models, and generated images are available at https://github.com/valeoai/IGLOSS.

Abstract:
This report presents our winning solution to the 5th PVUW MeViS‑Text Challenge. The track studies referring video object segmentation under motion‑centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training‑free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini‑3.1 Pro decomposes each target event into instance‑level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3‑agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5‑Plus and behavior‑level verification to correct ambiguous or semantically inconsistent predictions. Without task‑specific fine‑tuning, our method ranks first on the PVUW 2026 MeViS‑Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026.

Abstract:
Semantic segmentation and hyperspectral unmixing are two central problems in spectral image analysis. The former assigns each pixel a discrete label corresponding to its material class, whereas the latter estimates pure material spectra, called endmembers, and, for each pixel, a vector representing material abundances in the observed scene. Despite their complementarity, these two problems are usually addressed independently. This paper aims to bridge these two lines of work by formally showing that, under the linear mixing model, pixel classification by dominant materials induces polyhedral‑cone regions in the spectral space. We leverage this fundamental property to propose a direct segmentation‑to‑unmixing pipeline that performs blind hyperspectral unmixing from any semantic segmentation by constructing a polyhedral‑cone partition of the space that best fits the labeled pixels. Signed distances from pixels to the estimated regions are then computed, linearly transformed via a change of basis in the distance space, and projected onto the probability simplex, yielding an initial abundance estimate. This estimate is used to extract endmembers and recover final abundances via matrix pseudo‑inversion. Because the segmentation method can be freely chosen, the user gains explicit control over the unmixing process, while the rest of the pipeline remains essentially deterministic and lightweight. Beyond improving interpretability, experiments on three real datasets demonstrate the effectiveness of the proposed approach when associated with appropriate clustering algorithms, and show consistent improvements over recent deep and non‑deep state‑of‑the‑art methods. The code is available at: https://github.com/antoine‑bottenmuller/polyhedral‑unmixing

Abstract:
Training‑free open‑vocabulary remote sensing segmentation (OVRSS), empowered by vision‑language models, has emerged as a promising paradigm for achieving category‑agnostic semantic understanding in remote sensing imagery. Existing approaches mainly focus on enhancing feature representations or mitigating modality discrepancies to improve patch‑level prediction accuracy. However, such independent prediction schemes are fundamentally misaligned with the intrinsic characteristics of remote sensing data. In real‑world applications, remote sensing scenes are typically large‑scale and exhibit strong spatial as well as semantic correlations, making isolated patch‑wise predictions insufficient for accurate segmentation. To address this limitation, we propose ConInfer, a context‑aware inference framework for OVRSS that performs joint prediction across multiple spatial units while explicitly modeling their inter‑unit semantic dependencies. By incorporating global contextual cues, our method significantly enhances segmentation consistency, robustness, and generalization in complex remote sensing environments. Extensive experiments on multiple benchmark datasets demonstrate that our approach consistently surpasses state‑of‑the‑art per‑pixel VLM‑based baselines such as SegEarth‑OV, achieving average improvements of 2.80% and 6.13% on open‑vocabulary semantic segmentation and object extraction tasks, respectively. The implementation code is available at: https://github.com/Dog‑Yang/ConInfer

Abstract:
Robust scene understanding is essential for intelligent vehicles operating in natural, unstructured environments. While semantic segmentation datasets for structured urban driving are abundant, the datasets for extremely unstructured wild environments remain scarce due to the difficulty and cost of generating pixel‑accurate annotations. These limitations hinder the development of perception systems needed for intelligent ground vehicles tasked with forestry automation, agricultural robotics, disaster response, and all‑terrain mobility. To address this gap, we present ForestSim, a high‑fidelity synthetic dataset designed for training and evaluating semantic segmentation models for intelligent vehicles in forested off‑road and no‑road environments. ForestSim contains 2094 photorealistic images across 25 diverse environments, covering multiple seasons, terrain types, and foliage densities. Using Unreal Engine environments integrated with Microsoft AirSim, we generate consistent, pixel‑accurate labels across 20 classes relevant to autonomous navigation. We benchmark ForestSim using state‑of‑the‑art architectures and report strong performance despite the inherent challenges of unstructured scenes. ForestSim provides a scalable and accessible foundation for perception research supporting the next generation of intelligent off‑road vehicles. The dataset and code are publicly available: Dataset: https://vailforestsim.github.io Code: https://github.com/pragatwagle/ForestSim

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe‑based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi‑step reasoning, leading to sharp performance drops on motion‑intensive and reasoning‑oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video‑Instructed Reasoning Assistant for Spatio‑Temporal Segmentation), an end‑to‑end framework that unifies global video reasoning and pixel‑level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio‑Temporal Fusion (STF), which fuses segmentation‑aware video features into the vision‑language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state‑of‑the‑art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at https://github.com/AIDASLab/VIRST.

Abstract:
Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state‑of‑the‑art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt‑based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT‑LoRA, implements this by continually updating a lightweight LoRA module on‑the‑fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT‑LoRA implementation achieves an average 18‑34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP‑based fine‑grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.

Abstract:
Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi‑Head Self‑Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed‑ups across various hardware platforms compared to recent state‑of‑the‑art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

Abstract:
Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware‑efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware‑efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware‑efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU‑based inference. CPUBone achieves state‑of‑the‑art Speed‑Accuracy Trade‑offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.

Abstract:
LiDAR‑based semantic segmentation is a key component for autonomous mobile robots, yet large‑scale annotation of LiDAR point clouds is prohibitively expensive and time‑consuming. Although simulators can provide labeled synthetic data, models trained on synthetic data often underperform on real‑world data due to a data‑level domain gap. To address this issue, we propose DRUM, a novel Sim2Real translation framework. We leverage a diffusion model pre‑trained on unlabeled real‑world data as a generative prior and translate synthetic data by reproducing two key measurement characteristics: reflectance intensity and raydrop noise. To improve sample fidelity, we introduce a raydrop‑aware masked guidance mechanism that selectively enforces consistency with the input synthetic data while preserving realistic raydrop noise induced by the diffusion prior. Experimental results demonstrate that DRUM consistently improves Sim2Real performance across multiple representations of LiDAR data. The project page is available at https://miya‑tomoya.github.io/drum.

Abstract:
Accurate and temporally consistent segmentation of the left ventricle from echocardiography videos is essential for estimating the ejection fraction and assessing cardiac function. However, modeling spatiotemporal dynamics remains difficult due to severe speckle noise and rapid non‑rigid deformations. Existing linear recurrent models offer efficient in‑context associative recall for temporal tracking, but rely on unconstrained state updates, which cause progressive singular value decay in the state matrix, a phenomenon known as rank collapse, resulting in anatomical details being overwhelmed by noise. To address this, we propose OSA, a framework that constrains the state evolution on the Stiefel manifold. We introduce the Orthogonalized State Update (OSU) mechanism, which formulates the memory evolution as Euclidean projected gradient descent on the Stiefel manifold to prevent rank collapse and maintain stable temporal transitions. Furthermore, an Anatomical Prior‑aware Feature Enhancement module explicitly separates anatomical structures from speckle noise through a physics‑driven process, providing the temporal tracker with noise‑resilient structural cues. Comprehensive experiments on the CAMUS and EchoNet‑Dynamic datasets show that OSA achieves state‑of‑the‑art segmentation accuracy and temporal stability, while maintaining real‑time inference efficiency for clinical deployment. Codes are available at https://github.com/wangrui2025/OSA.

Abstract:
Vision Foundation Models (VFMs) pre‑trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM‑based encoder‑only models for image and video segmentation, such as EoMT and VidEoMT, achieve competitive accuracy with remarkably low latency, yet they require finetuning the encoder, sacrificing the multi‑task encoder sharing that makes VFMs practically attractive for large‑scale deployment. To reconcile encoder‑only simplicity and speed with frozen VFM features, we propose the Plain Mask Decoder (PMD), a fast Transformer‑based segmentation decoder that operates on top of frozen VFM features. The resulting model, the Plain Mask Transformer (PMT), preserves the architectural simplicity and low latency of encoder‑only designs while keeping the encoder representation unchanged and shareable. The design seamlessly applies to both image and video segmentation, inheriting the generality of the encoder‑only framework. On standard image segmentation benchmarks, PMT matches the frozen‑encoder state of the art while running up to ~3x faster. For video segmentation, it even performs on par with fully finetuned methods, while being up to 8x faster than state‑of‑the‑art frozen‑encoder models. Code: https://github.com/tue‑mps/pmt.

Abstract:
Recently, state space models have demonstrated efficient video segmentation through linear‑complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel‑level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed‑size state space inevitably forgets specific information, which limits the models' capability for pixel‑level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS‑SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel‑wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS‑SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel‑level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS‑SSM achieves state‑of‑the‑art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026‑RS‑SSM.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training‑free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object‑level evidence is available, limiting both reasoning quality and spatio‑temporal coverage. To overcome this, we propose AgentRVOS, a training‑free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio‑temporal extent through generated mask tracks. The MLLM then identifies the target through query‑grounded reasoning over this object‑level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state‑of‑the‑art performance among training‑free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab‑kaist.github.io/AgentRVOS/.

Abstract:
A sliding‑window inference strategy is commonly adopted in recent training‑free open‑vocabulary semantic segmentation methods to overcome limitation of the CLIP in processing high‑resolution images. However, this approach introduces a new challenge: each window is processed independently, leading to semantic discrepancy across windows. To address this issue, we propose Global‑Local Aligned CLIP~(GLA‑CLIP), a framework that facilitates comprehensive information exchange across windows. Rather than limiting attention to tokens within individual windows, GLA‑CLIP extends key‑value tokens to incorporate contextual cues from all windows. Nevertheless, we observe a window bias: outer‑window tokens are less likely to be attended, since query features are produced through interactions within the inner window patches, thereby lacking semantic grounding beyond their local context. To mitigate this, we introduce a proxy anchor, constructed by aggregating tokens highly similar to the given query from all windows, which provides a unified semantic reference for measuring similarity across both inner‑ and outer‑window patches. Furthermore, we propose a dynamic normalization scheme that adjusts attention strength according to object scale by dynamically scaling and thresholding the attention map to cope with small‑object scenarios. Moreover, GLA‑CLIP can be equipped on existing methods and broad their receptive field. Extensive experiments validate the effectiveness of GLA‑CLIP in enhancing training‑free open‑vocabulary semantic segmentation performance. Code is available at https://github.com/2btlFe/GLA‑CLIP.

Abstract:
Attribution maps for semantic segmentation are almost always judged by visual plausibility. Yet looking convincing does not guarantee that the highlighted pixels actually drive the model's prediction, nor that attribution credit stays within the target region. These questions require a dedicated evaluation protocol. We introduce a reproducible benchmark that tests intervention‑based faithfulness, off‑target leakage, perturbation robustness, and runtime on Pascal VOC and SBD across three pretrained backbones. To further demonstrate the benchmark, we propose Dual‑Evidence Attribution (DEA), a lightweight correction that fuses gradient evidence with region‑level intervention signals through agreement‑weighted fusion. DEA increases emphasis where both sources agree and retains causal support when gradient responses are unstable. Across all completed runs, DEA consistently improves deletion‑based faithfulness over gradient‑only baselines and preserves strong robustness, at the cost of additional compute from intervention passes. The benchmark exposes a faithfulness‑stability tradeoff among attribution families that is entirely hidden under visual evaluation, providing a foundation for principled method selection in segmentation explainability. Code is available at https://github.com/anmspro/DEA.

Abstract:
The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text‑trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text‑trajectory alignment, where MLLMs accept grounding‑intended (text‑to‑trajectory) and captioning‑intended (trajectory‑to‑text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame‑level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM‑parsed trajectory‑level token to frame‑specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end‑to‑end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.

Abstract:
Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation‑model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt‑free, boundary‑aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry‑consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region‑ and boundary‑based terms. Evaluations on both public and private high‑throughput microscopy datasets demonstrate improved boundary accuracy and instance‑level performance compared to recent SAM‑based and foundation‑model approaches. Source code is available at: https://github.com/ThomasMendelson/BAISeg.git

Abstract:
Dense semantic segmentation in dynamic environments is fundamentally limited by the low‑frame‑rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR‑Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty‑aware warping process, guided by an event‑driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high‑frequency synthetic benchmark (SHF‑DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper‑bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high‑frame‑rate perception with low‑frame‑rate hardware. Project Page: https://candy‑crusher.github.io/LiFR_Seg_Proj/#; Code: https://github.com/Candy‑Crusher/LiFR‑Seg.git.

Abstract:
As one of the most important underwater sensing technologies, forward‑looking sonar exhibits unique imaging characteristics. Sonar images are often affected by severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions. These factors make it difficult for traditional teacher‑student frameworks to achieve satisfactory performance in sonar semantic segmentation tasks under extremely limited labeled data conditions. To address this issue, we propose a Collaborative Teacher Semantic Segmentation Framework for forward‑looking sonar images. This framework introduces a multi‑teacher collaborative mechanism composed of one general teacher and multiple sonar‑specific teachers. By adopting a multi‑teacher alternating guidance strategy, the student model can learn general semantic representations while simultaneously capturing the unique characteristics of sonar images, thereby achieving more comprehensive and robust feature modeling. Considering the challenges of sonar images, which can lead teachers to generate a large number of noisy pseudo‑labels, we further design a cross‑teacher reliability assessment mechanism. This mechanism dynamically quantifies the reliability of pseudo‑labels by evaluating the consistency and stability of predictions across multiple views and multiple teachers, thereby mitigating the negative impact caused by noisy pseudo‑labels. Notably, on the FLSMD dataset, when only 2% of the data is labeled, our method achieves a 5.08% improvement in mIoU compared to other state‑of‑the‑art approaches.

Abstract:
Recent advancements in multimodal large models have significantly bridged the representation gap between diverse modalities, catalyzing the evolution of video multimodal interpretation, which enhances users' understanding of video content by generating correlated modalities. However, most existing video multimodal interpretation methods primarily concentrate on global comprehension with limited user interaction. To address this, we propose a novel task, Controllable Video Segmentation and Captioning (SegCaptioning), which empowers users to provide specific prompts, such as a bounding box around an object of interest, to simultaneously generate correlated masks and captions that precisely embody user intent. An innovative framework Scene Graph‑guided Fine‑grained SegCaptioning Transformer (SG‑FSCFormer) is designed that integrates a Prompt‑guided Temporal Graph Former to effectively captures and represents user intent through an adaptive prompt adaptor, ensuring that the generated content well aligns with the user's requirements. Furthermore, our model introduces a Fine‑grained Mask‑linguistic Decoder to collaboratively predict high‑quality caption‑mask pairs using a Multi‑entity Contrastive loss, as well as provide fine‑grained alignment between each mask and its corresponding caption tokens, thereby enhancing users' comprehension of videos. Comprehensive experiments conducted on two benchmark datasets demonstrate that SG‑FSCFormer achieves remarkable performance, effectively capturing user intent and generating precise multimodal outputs tailored to user specifications. Our code is available at https://github.com/XuZhang1211/SG‑FSCFormer.

Abstract:
Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross‑modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long‑range dependencies, yet such aggregation indiscriminately propagates cloud‑induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large‑scale and high‑resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency‑reliability trade‑off. To address this dilemma, we propose EDC, an efficiency‑oriented and discrepancy‑conditioned optical‑SAR semantic segmentation framework. A tri‑stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy‑Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher‑guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56% and 0.88% on M3M‑CR and WHU‑OPT‑SAR, respectively, while reducing the number of parameters by 46.7% and accelerating inference by 1.98×. Our implementation is available at https://github.com/mengcx0209/EDC.

Abstract:
State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non‑sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil‑Mamba, a novel visual state space architecture built on a multi‑filter scanning backbone. Unlike fixed multi‑directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil‑Mamba achieves superior performance over existing state‑of‑the‑art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top‑1 accuracy on ImageNet‑1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal‑khadka/MFil‑Mamba.

Abstract:
Few‑shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype‑based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty‑aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few‑shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual‑stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state‑of‑the‑art performance under different settings while providing reliable uncertainty estimation. The code is available at https://fdueblab‑upl.github.io/.

Abstract:
We present LoD‑Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD‑Loc v2 achieves localization through semantic building silhouette alignment with low‑detail city models, it suffers from two key limitations: poor cross‑scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD‑Loc ‑ the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero‑shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD‑Loc v3 outperforms existing state‑of‑the‑art (SOTA) baselines, achieving superior performance in both cross‑scene and dense urban scenarios with a large margin. The project is available at https://nudt‑sawlab.github.io/LoD‑Locv3/.

Abstract:
Open‑world semantic segmentation presently relies significantly on extensive image‑text pair datasets, which often suffer from a lack of fine‑grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model‑driven pipeline for automatically generating datasets tailored to the needs of open‑world semantic segmentation, named "MagicSeg". Our MagicSeg initiates from class labels and proceeds to generate high‑fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self‑supervised signal for open‑world segmentation pretraining, our MagicSeg integrates an open‑vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language‑image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open‑world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset's effectiveness in enhancing open‑world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.

Abstract:
With the growing adoption of vision‑language‑action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter‑view inconsistency when applied to high‑resolution multi‑view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi‑view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross‑attention. For decoding, we employ a multi‑view transformer to reconstruct multi‑view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi‑view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

Abstract:
Bird's‑Eye‑View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding‑view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end‑to‑end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting‑assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre‑train a Gaussian generator that explicitly reconstructs 3D scenes from multi‑view inputs, enabling the generation of geometry‑aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state‑of‑the‑art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

Abstract:
Deploying high‑performance dense prediction models on resource‑constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN‑based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge‑friendly encoder decoder design. On the COCO dataset, ECDet‑S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF‑DETR while using substantially fewer parameters. For pose estimation, ECPose‑X reaches 74.8 AP, significantly outperforming YOLO26Pose‑X (71.6 AP). These results show that compact ViTs, when paired with task‑specialized distillation and edge‑aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust‑ai‑lab.github.io/projects/EdgeCrafter/

Abstract:
Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest‑classified neighbor, forcing the model to improve the prediction at the pixels' neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP‑SameClassNeighborPenalization.

Abstract:
Collecting and annotating datasets for pixel‑level semantic segmentation tasks are highly labor‑intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real‑world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel‑level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class‑aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data‑scarce scenarios, while improving model robustness in real‑world applications. Our code is available at \hrefhttps://github.com/chequanghuy/Enhanced‑Generative‑Data‑Augmentation‑for‑Semantic‑Segmentation‑via‑Stronger‑Guidancehttps://github.com/chequanghuy/Enhanced‑Generative‑Data‑Augmentation‑for‑Semantic‑Segmentation‑via‑Stronger‑Guidance.

Abstract:
Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB‑T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB‑T alignment on off‑the‑shelf UAVs. To address these challenges, we propose a scalable geometry‑driven 2D‑3D‑2D paradigm that leverages multi‑view redundancy in high‑overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground‑truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D‑3D‑2D paradigm to cross‑modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel‑level RGB‑T alignment with 87% registration accuracy and no hardware‑level synchronization. Applying our framework to existing geo‑referenced aerial imagery, we construct SegFly, a large‑scale benchmark with over 20,000 high‑resolution RGB images and more than 15,000 geometrically aligned RGB‑T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry‑driven 2D‑3D‑2D pipelines for scalable multi‑modal scene understanding. Data and Code available at https://github.com/markus‑42/SegFly.

Abstract:
Cell segmentation is a fundamental task in microscopy image analysis. Several foundation models for cell segmentation have been introduced, virtually all of them are extensions of Segment Anything Model (SAM), improving it for microscopy data. Recently, SAM2 and SAM3 have been published, further improving and extending the capabilities of general‑purpose segmentation foundation models. Here, we comprehensively evaluate foundation models for cell segmentation (CellPoseSAM, CellSAM, μSAM) and for general‑purpose segmentation (SAM, SAM2, SAM3) on a diverse set of (light) microscopy datasets, for tasks including cell, nucleus and organoid segmentation. Furthermore, we introduce a new instance segmentation strategy called automatic prompt generation (APG) that can be used to further improve SAM‑based microscopy foundation models. APG consistently improves segmentation results for μSAM, which is used as the base model, and is competitive with the state‑of‑the‑art model CellPoseSAM. Moreover, our work provides important lessons for adaptation strategies of SAM‑style models to microscopy and provides a strategy for creating even more powerful microscopy foundation models. Our code is publicly available at https://github.com/computational‑cell‑analytics/micro‑sam.

Abstract:
Multimodal remote sensing semantic segmentation enhances scene interpretation by exploiting complementary physical cues from heterogeneous data. Although pretrained Vision Foundation Models (VFMs) provide strong general‑purpose representations, adapting them to multimodal tasks often incurs substantial computational overhead and is prone to modality imbalance, where the contribution of auxiliary modalities is suppressed during optimization. To address these challenges, we propose MoBaNet, a parameter‑efficient and modality‑balanced symmetric fusion framework. Built upon a largely frozen VFM backbone, MoBaNet adopts a symmetric dual‑stream architecture to preserve generalizable representations while minimizing the number of trainable parameters. Specifically, we design a Cross‑modal Prompt‑Injected Adapter (CPIA) to enable deep semantic interaction by generating shared prompts and injecting them into bottleneck adapters under the frozen backbone. To obtain compact and discriminative multimodal representations for decoding, we further introduce a Difference‑Guided Gated Fusion Module (DGFM), which adaptively fuses paired stage features by explicitly leveraging cross‑modal discrepancy to guide feature selection. Furthermore, we propose a Modality‑Conditional Random Masking (MCRM) strategy to mitigate modality imbalance by masking one modality only during training and imposing hard‑pixel auxiliary supervision on modality‑specific branches. Extensive experiments on the ISPRS Vaihingen and Potsdam benchmarks demonstrate that MoBaNet achieves state‑of‑the‑art performance with significantly fewer trainable parameters than full fine‑tuning, validating its effectiveness for robust and balanced multimodal fusion. The source code in this work is available at https://github.com/sauryeo/MoBaNet.

Abstract:
Autonomous landing of uncrewed aerial vehicles (UAVs) in unknown, dynamic environments poses significant safety challenges, particularly near people and infrastructure, as UAVs transition to routine urban and rural operations. Existing methods often rely on prior maps, heavy sensors like LiDAR, static markers, or fail to handle non‑cooperative dynamic obstacles like humans, limiting generalization and real‑time performance. To address these challenges, we introduce SafeLand, a lean, vision‑based system for safe autonomous landing (SAL) that requires no prior information and operates only with a camera and a lightweight height sensor. Our approach constructs an online semantic ground map via deep learning‑based semantic segmentation, optimized for embedded deployment and trained on a consolidation of seven curated public aerial datasets (achieving 70.22% mIoU across 20 classes), which is further refined through Bayesian probabilistic filtering with temporal semantic decay to robustly identify metric‑scale landing spots. A behavior tree then governs adaptive landing, iteratively validates the spot, and reacts in real time to dynamic obstacles by pausing, climbing, or rerouting to alternative spots, maximizing human safety. We extensively evaluate our method in 200 simulations and 60 end‑to‑end field tests across industrial, urban, and rural environments at altitudes up to 100m, demonstrating zero false negatives for human detection. Compared to the state of the art, SafeLand achieves sub‑second response latency, substantially lower than previous methods, while maintaining a superior success rate of 95%. To facilitate further research in aerial robotics, we release SafeLand's segmentation model as a plug‑and‑play ROS package, available at https://github.com/markus‑42/SafeLand.

Abstract:
Reliable terrain perception is a fundamental requirement for autonomous navigation in unstructured, off‑road environments. Desert landscapes present unique challenges due to low chromatic contrast between terrain categories, extreme lighting variability, and sparse vegetation that defy the assumptions of standard road‑scene segmentation models. We present DesertFormer, a semantic segmentation pipeline for off‑road desert terrain analysis based on SegFormer B2 with a hierarchical Mix Transformer (MiT‑B2) backbone. The system classifies terrain into ten ecologically meaningful categories ‑‑ Trees, Lush Bushes, Dry Grass, Dry Bushes, Ground Clutter, Flowers, Logs, Rocks, Landscape, and Sky ‑‑ enabling safety‑aware path planning for ground robots and autonomous vehicles. Trained on a purpose‑built dataset of 4,176 annotated off‑road images at 512x512 resolution, DesertFormer achieves a mean Intersection‑over‑Union (mIoU) of 64.4% and pixel accuracy of 86.1%, representing a +24.2% absolute improvement over a DeepLabV3 MobileNetV2 baseline (41.0% mIoU). We further contribute a systematic failure analysis identifying the primary confusion patterns ‑‑ Ground Clutter to Landscape and Dry Grass to Landscape ‑‑ and propose class‑weighted training and copy‑paste augmentation for rare terrain categories. Code, checkpoints, and an interactive inference dashboard are released at https://github.com/Yasaswini‑ch/Vision‑based‑Desert‑Terrain‑Segmentation‑using‑SegFormer.

Abstract:
Cross‑domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real‑world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open‑set semantics across domains. In this work, we formulate an open‑set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA‑PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler‑Margin Attention (EMA), which introduces an angular margin to enhance viewpoint‑invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high‑order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera‑shift, weather‑condition, and open‑set scenarios demonstrate that EDA‑PSeg achieves state‑of‑the‑art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at https://github.com/zyfone/EDA‑PSeg.

Abstract:
Multimodal semantic segmentation with heterogeneous sensors must reconcile complementary information across modalities that differ in spatial resolution and channel dimensionality. In particular, high‑resolution RGB imaging provides detailed spatial structure but often fails to distinguish visually similar materials, whereas hyperspectral imaging (HSI) provides discriminative spectral signatures but at lower spatial resolution. We present Bidirectional Cross‑Attention Fusion (BCAF), which aligns high‑resolution RGB with low‑resolution HSI at their native grids via localized, bidirectional cross‑attention, avoiding pre‑upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI‑adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self‑attention. Although our evaluation targets RGB‑HSI fusion, BCAF is modality‑agnostic and applies to co‑registered RGB with lower‑resolution, high‑channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF delivers strong performance, achieving 75.4% at 55 images/s. We further evaluate a novel industrial dataset: K3I‑Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic‑type segmentation (PET, PP, HDPE, LDPE, PS, etc.). These results show that preserving native‑grid spatial detail and spectral structure improves multimodal segmentation under real‑time constraints. Code and model checkpoints are publicly available at https://github.com/jonasvilhofunk/BCAF_2026.

Abstract:
Promptable instance segmentation is widely adopted in embodied and AR systems, yet the performance of foundation models trained on perspective imagery often degrades on 360° panoramas. In this paper, we introduce Segment Any 4K Panorama (SAP), a foundation model for 4K high‑resolution panoramic instance‑level segmentation. We reformulate panoramic segmentation as fixed‑trajectory perspective video segmentation, decomposing a panorama into overlapping perspective patches sampled along a continuous spherical traversal. This memory‑aligned reformulation preserves native 4K resolution while restoring the smooth viewpoint transitions required for stable cross‑view propagation. To enable large‑scale supervision, we synthesize 183,440 4K‑resolution panoramic images with instance segmentation labels using the InfiniGen engine. Trained under this trajectory‑aligned paradigm, SAP generalizes effectively to real‑world 360° images, achieving +17.2 zero‑shot mIoU gain over vanilla SAM2 of different sizes on real‑world 4K panorama benchmark.

Abstract:
Accurate visual fault detection in freight trains remains a critical challenge for intelligent transportation system maintenance, due to complex operational environments, structurally repetitive components, and frequent occlusions or contaminations in safety‑critical regions. Conventional instance segmentation methods based on convolutional neural networks and Transformers often suffer from poor generalization and limited boundary accuracy under such conditions. To address these challenges, we propose a lightweight self‑prompted instance segmentation framework tailored for freight train fault detection. Our method leverages the Segment Anything Model by introducing a self‑prompt generation module that automatically produces task‑specific prompts, enabling effective knowledge transfer from foundation models to domain‑specific inspection tasks. In addition, we adopt a Tiny Vision Transformer backbone to reduce computational cost, making the framework suitable for real‑time deployment on edge devices in railway monitoring systems. We construct a domain‑specific dataset collected from real‑world freight inspection stations and conduct extensive evaluations. Experimental results show that our method achieves 74.6 AP^\textbox and 74.2 AP^\textmask on the dataset, outperforming existing state‑of‑the‑art methods in both accuracy and robustness while maintaining low computational overhead. This work offers a deployable and efficient vision solution for automated freight train inspection, demonstrating the potential of foundation model adaptation in industrial‑scale fault diagnosis scenarios. Project page: https://github.com/MVME‑HBUT/SAM_FTI‑FDet.git

Abstract:
Autonomous space operations such as on‑orbit servicing and active debris removal demand robust part‑level semantic understanding and precise relative navigation of target spacecraft, yet collecting large‑scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single‑modality sensing, and incomplete ground‑truth annotations. We present SpaceSense‑Bench, a large‑scale multi‑modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time‑synchronized 1024×1024 RGB images, millimeter‑precision depth maps, and 256‑beam LiDAR point clouds, together with dense 7‑class part‑level semantic labels at both the pixel and point level as well as accurate 6‑DoF pose ground truth. The dataset is generated through a high‑fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi‑stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB‑‑LiDAR fusion‑based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small‑scale components (\emphe.g., thrusters and omni‑antennas) and generalizing to entirely unseen spacecraft in a zero‑shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large‑scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at https://github.com/wuaodi/SpaceSense‑Bench.

Abstract:
RGB‑Thermal (RGB‑T) semantic segmentation is essential for robotic systems operating in low‑light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross‑modal knowledge distillation and modality‑adaptive fine‑tuning attempt to enhance cross‑modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi‑stage training with frozen models or teacher‑student frameworks. We present RTFDNet, a three‑branch encoder‑decoder that unifies fusion and decoupling for robust RGB‑T segmentation. Synergistic Feature Fusion (SFF) performs channel‑wise gated exchange and lightweight spatial attention to inject complementary cues. Cross‑Modal Decouple Regularization (CMDR) isolates modality‑specific components from the fused representation and supervises unimodal decoders via stop‑gradient targets. Region Decouple Regularization (RDR) enforces class‑selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at https://github.com/curapima/RTFDNet.

Abstract:
Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba‑based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross‑task generalization. To address this limitation, we incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba‑based architectures. Specifically, we introduce EQ‑VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ‑VMamba include a carefully designed rotation equivariant cross‑scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end‑to‑end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks ‑‑ including high‑level image classification, mid‑level semantic segmentation, and low‑level image super‑resolution ‑‑ demonstrate that EQ‑VMamba consistently improves rotation robustness and achieves superior or competitive performance compared to non‑equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at https://github.com/zhongchenzhao/EQ‑VMamba.

Abstract:
Pretraining and fine‑tuning have emerged as a new paradigm in remote sensing image interpretation. Among them, Masked Autoencoder (MAE)‑based pretraining stands out for its strong capability to learn general feature representations via reconstructing masked image regions. However, applying MAE to multispectral remote sensing images remains challenging due to complex backgrounds, indistinct targets, and the lack of semantic guidance during masking, which hinders the learning of underlying structures and meaningful spatial‑spectral features. To address this, we propose a simple yet effective approach, Spectral Index‑Guided MAE (SIGMAE), for multispectral image pretraining. The core idea is to incorporate domain‑specific spectral indices as prior knowledge to guide dynamic token masking toward informative regions. SIGMAE introduces Semantic Saliency‑Guided Dynamic Token Masking (SSDTM), a curriculum‑style strategy that quantifies each patch's semantic richness and internal heterogeneity to adaptively select the most informative tokens during training. By prioritizing semantically salient regions and progressively increasing sample difficulty, SSDTM enhances spectrally rich and structurally aware representation learning, mitigates overfitting, and reduces redundant computation compared with random masking. Extensive experiments on five widely used datasets covering various downstream tasks, including scene classification, semantic segmentation, object extraction and change detection, demonstrate that SIGMAE outperforms other pretrained geospatial foundation models. Moreover, it exhibits strong spatial‑spectral reconstruction capability, even with a 90% mask ratio, and improves complex target recognition under limited labeled data. The source codes and model weights will be released at https://github.com/zxk688/SIGMAE.

Abstract:
As powerful generative models, text‑to‑image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre‑trained diffusion model to semantic segmentation without any further training, leading to training‑free diffusion segmentors. These methods typically rely on cross‑attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross‑attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per‑pixel rescaling, which together enable training‑free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at https://github.com/Darkbblue/goca.

Abstract:
Garment manipulation has attracted increasing attention due to its critical role in home‑assistant robotics. However, the majority of existing garment manipulation works assume an initial state consisting of only one garment, while piled garments are far more common in real‑world settings. To bridge this gap, we propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt, establishing a robust foundation for the execution of downstream tasks (e.g., folding, hanging, wearing). Our pipeline seamlessly integrates vision‑language reasoning with visual affordance perception, fully leveraging the high‑level reasoning and planning capabilities of VLMs alongside the generalization power of visual affordance for low‑level actions. To enhance the VLM's comprehensive awareness of each garment's state within a garment pile, we employ visual segmentation model (SAM2) to execute object segmentation on the garment pile for aiding VLM‑based reasoning with sufficient visual cues. A mask fine‑tuning mechanism is further integrated to address scenarios where the initial segmentation results are suboptimal. In addition, a dual‑arm cooperation framework is deployed to address cases involving large or long garments, as well as excessive garment sagging caused by incorrect grasping point determination, both of which are strenuous for a single arm to handle. The effectiveness of our pipeline are consistently demonstrated across diverse tasks and varying scenarios in both real‑world and simulation environments. Project page: https://garmentpile2.github.io/.

Abstract:
Multimodal Image Fusion (MMIF) integrates complementary information from various modalities to produce clearer and more informative fused images. MMIF under adverse weather is particularly crucial in autonomous driving and UAV monitoring applications. However, existing adverse weather fusion methods generally only tackle single types of degradation such as haze, rain, or snow, and fail when multiple degradations coexist (e.g., haze+rain, rain+snow). To address this challenge, we propose Compound Adverse Weather Mamba (CAWM‑Mamba), the first end‑to‑end framework that jointly performs image fusion and compound weather restoration with unified shared weights. Our network contains three key components: (1) a Weather‑Aware Preprocess Module (WAPM) to enhance degraded visible features and extracts global weather embeddings; (2) a Cross‑modal Feature Interaction Module (CFIM) to facilitate the alignment of heterogeneous modalities and exchange of complementary features across modalities; and (3) a Wavelet Space State Block (WSSB) that leverages wavelet‑domain decomposition to decouple multi‑frequency degradations. WSSB includes Freq‑SSM, a module that models anisotropic high‑frequency degradation without redundancy, and a unified degradation representation mechanism to further improve generalization across complex compound weather conditions. Extensive experiments on the AWMM‑100K benchmark and three standard fusion datasets demonstrate that CAWM‑Mamba consistently outperforms state‑of‑the‑art methods in both compound and single‑weather scenarios. In addition, our fusion results excel in downstream tasks covering semantic segmentation and object detection, confirming the practical value in real‑world adverse weather perception. The source code will be available at https://github.com/Feecuin/CAWM‑Mamba.

Abstract:
Knowledge distillation (KD) has been widely applied in semantic segmentation to compress large models, but conventional approaches primarily preserve in‑domain accuracy while neglecting out‑of‑domain generalization, which is essential under distribution shifts. This limitation becomes more severe with the emergence of vision foundation models (VFMs): although VFMs exhibit strong robustness on unseen data, distilling them with conventional KD often compromises this ability. We propose Generalizable Knowledge Distillation (GKD), a multi‑stage framework that explicitly enhances generalization. GKD decouples representation learning from task learning. In the first stage, the student acquires domain‑agnostic representations through selective feature distillation, and in the second stage, these representations are frozen for task adaptation, thereby mitigating overfitting to visible domains. To further support transfer, we introduce a query‑based soft distillation mechanism, where student features act as queries to teacher representations to selectively retrieve transferable spatial knowledge from VFMs. Extensive experiments on five domain generalization benchmarks demonstrate that GKD consistently outperforms existing KD methods, achieving average gains of +1.9% in foundation‑to‑foundation (F2F) and +10.6% in foundation‑to‑local (F2L) distillation. The code will be available at https://github.com/Younger‑hua/GKD.

Abstract:
In real underwater environments, downstream image recognition tasks such as semantic segmentation and object detection often face challenges posed by problems like blurring and color inconsistencies. Underwater image enhancement (UIE) has emerged as a promising preprocessing approach, aiming to improve the recognizability of targets in underwater images. However, most existing UIE methods mainly focus on enhancing images for human visual perception, frequently failing to reconstruct high‑frequency details that are critical for task‑specific recognition. To address this issue, we propose a Downstream Task‑Inspired Underwater Image Enhancement (DTI‑UIE) framework, which leverages human visual perception model to enhance images effectively for underwater vision tasks. Specifically, we design an efficient two‑branch network with task‑aware attention module for feature mixing. The network benefits from a multi‑stage training framework and a task‑driven perceptual loss. Additionally, inspired by human perception, we automatically construct a Task‑Inspired UIE Dataset (TI‑UIED) using various task‑specific networks. Experimental results demonstrate that DTI‑UIE significantly improves task performance by generating preprocessed images that are beneficial for downstream tasks such as semantic segmentation, object detection, and instance segmentation. The codes are publicly available at https://github.com/oucailab/DTIUIE.

Abstract:
In recent years, various methods have been proposed for mesh analysis, each offering distinct advantages and often excelling on different object classes. We present a novel Mixture of Experts (MoE) framework designed to harness the complementary strengths of these diverse approaches. We propose a new gate architecture that encourages each expert to specialise in the classes it excels in. Our design is guided by two key ideas: (1) random walks over the mesh surface effectively capture the regions that individual experts attend to, and (2) an attention mechanism that enables the gate to focus on the areas most informative for each expert's decision‑making. To further enhance performance, we introduce a dynamic loss balancing scheme that adjusts a trade‑off between diversity and similarity losses throughout the training, where diversity prompts expert specialization, and similarity enables knowledge sharing among the experts. Our framework achieves state‑of‑the‑art results in mesh classification, retrieval, and semantic segmentation tasks. Our code is available at: https://github.com/amirbelder/MME‑Mixture‑of‑Mesh‑Experts.

Abstract:
High‑quality pixel‑level annotations are essential for the semantic segmentation of remote sensing imagery. However, such labels are expensive to obtain and often affected by noise due to the labor‑intensive and time‑consuming nature of pixel‑wise annotation, which makes it challenging for human annotators to label every pixel accurately. Annotation errors can significantly degrade the performance and robustness of modern segmentation models, motivating the need for reliable mechanisms to identify and quantify noisy training samples. This paper introduces a novel Data‑Centric benchmark, together with a novel, publicly available dataset and two techniques for identifying, quantifying, and ranking training samples according to their level of label noise in remote sensing semantic segmentation. Such proposed methods leverage complementary strategies based on model uncertainty, prediction consistency, and representation analysis, and consistently outperform established baselines across a range of experimental settings. The outcomes of this work are publicly available at https://github.com/keillernogueira/label_noise_segmentation.

Abstract:
Transforming image features from perspective view (PV) space to bird's‑eye‑view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large‑scale nuScenes dataset. Experimental results show consistent improvements ‑‑ with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively ‑‑ without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at https://github.com/JeongbinHong/CycleBEV.

Abstract:
Transformer‑based real‑time object detectors achieve strong accuracy‑latency trade‑offs, and D‑FINE is among the top‑performing recent architectures. However, real‑time instance segmentation with transformers is still less common. We present D‑FINE‑seg, an instance segmentation extension of D‑FINE that adds: a lightweight mask head, segmentation‑aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D‑FINE‑seg improves F1‑score over Ultralytics YOLO26 under a unified TensorRT FP16 end‑to‑end benchmarking protocol, while maintaining competitive latency. Second contribution is an end‑to‑end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open‑source under the Apache‑2.0 license. GitHub repository ‑ https://github.com/ArgoHA/D‑FINE‑seg.

Abstract:
This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our models on two domain‑specific datasets to demonstrate their competitiveness with the state of the art in certain scenarios, in spite of being severely bottlenecked by our limited computational resources. We supplement these analyses by proposing several promising approaches to foster future competitiveness in general‑purpose applications, and facilitate this by making our code and models publicly available.

Abstract:
Foundation models are transforming Earth Observation (EO), yet the diversity of EO sensors and modalities makes a single universal model unrealistic. Multiple specialized EO foundation models (EOFMs) will likely coexist, making efficient knowledge transfer across modalities essential. Most existing EO pretraining relies on masked image modeling, which emphasizes local reconstruction but provides limited control over global semantic structure. To address this, we propose a dual‑teacher contrastive distillation framework for multispectral imagery that aligns the student's pretraining objective with the contrastive self‑distillation paradigm of modern optical vision foundation models (VFMs). Our approach combines a multispectral teacher with an optical VFM teacher, enabling coherent cross‑modal representation learning. Experiments across diverse optical and multispectral benchmarks show that our model adapts to multispectral data without compromising performance on optical‑only inputs, achieving state‑of‑the‑art results in both settings, with an average improvement of 3.64 percentage points in semantic segmentation, 1.2 in change detection, and 1.31 in classification tasks. This demonstrates that contrastive distillation provides a principled and efficient approach to scalable representation learning across heterogeneous EO data sources. Project page: \textcolormagentahttps://wolfilip.github.io/DEO/.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated strong performance on the UI‑to‑code task, which aims to generate UI code from design mock‑ups. However, when applied to long and complex websites, they often struggle with fragmented segmentation, redundant code generation for repetitive components, and frequent UI inconsistencies. To systematically investigate and address these challenges, we introduce ComUIBench, a new multi‑page complex webpage benchmark with component annotations, designed to evaluate MLLMs' ability to generate reusable UI code in realistic website scenarios. Building upon this benchmark, we propose ComUICoder, a component‑based UI code generation framework that emphasizes semantic‑aware segmentation, code reuse, and fine‑grained refinement. Specifically, ComUICoder incorporates (1) Hybrid Semantic‑aware Block Segmentation for accurate UI semantic coherent block detection, (2) Visual‑aware Graph‑based Block Merge to consolidate structurally similar components within and across webpages for reusable implementation, and (3) Priority‑based Element‑wise Feedback to refine generated code and reduce element‑level inconsistencies. Extensive experiments demonstrate that ComUICoder significantly improves overall generation quality and code reusability on complex multipage websites. Our datasets and code are publicly available at https://github.com/WebPAI/ComUICoder.

Abstract:
Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero‑shot text‑guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text‑based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training‑free or lightweight LoRA‑tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM's grid‑based proposals, achieving state‑of‑the‑art open‑vocabulary semantic segmentation (OVSS) in a completely zero‑shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT‑5 in a zero‑shot setting and a LoRA‑tuned Qwen‑VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open‑vocabulary, referring, and reasoning‑based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree‑rs‑segmentation.

Abstract:
Domain‑generalized LiDAR semantic segmentation (LSS) seeks to train models on source‑domain point clouds that generalize reliably to multiple unseen target domains, which is essential for real‑world LiDAR applications. However, existing approaches assume similar acquisition views (e.g., vehicle‑mounted) and struggle in cross‑view scenarios, where observations differ substantially due to viewpoint‑dependent structural incompleteness and non‑uniform point density. Accordingly, we formulate cross‑view domain generalization for LiDAR semantic segmentation and propose a novel framework, termed CVGC (Cross‑View Geometric Consistency). Specifically, we introduce a cross‑view geometric augmentation module that models viewpoint‑induced variations in visibility and sampling density, generating multiple cross‑view observations of the same scene. Subsequently, a geometric consistency module enforces consistent semantic and occupancy predictions across geometrically augmented point clouds of the same scene. Extensive experiments on six public LiDAR datasets establish the first systematic evaluation of cross‑view domain generalization for LiDAR semantic segmentation, demonstrating that CVGC consistently outperforms state‑of‑the‑art methods when generalizing from a single source domain to multiple target domains with heterogeneous acquisition viewpoints. The source code will be publicly available at https://github.com/KintomZi/CVGC‑DG

Abstract:
In real‑world scenarios, the performance of semantic segmentation often deteriorates when processing low‑quality (LQ) images, which may lack clear semantic structures and high‑frequency details. Although image restoration techniques offer a promising direction for enhancing degraded visual content, conventional real‑world image restoration (Real‑IR) models primarily focus on pixel‑level fidelity and often fail to recover task‑relevant semantic cues, limiting their effectiveness when directly applied to downstream vision tasks. Conversely, existing segmentation models trained on high‑quality data lack robustness under real‑world degradations. In this paper, we propose Restoration Adaptation for Semantic Segmentation (RASS), which effectively integrates semantic image restoration into the segmentation process, enabling high‑quality semantic segmentation on the LQ images directly. Specifically, we first propose a Semantic‑Constrained Restoration (SCR) model, which injects segmentation priors into the restoration model by aligning its cross‑attention maps with segmentation masks, encouraging semantically faithful image reconstruction. Then, RASS transfers semantic restoration knowledge into segmentation through LoRA‑based module merging and task‑specific fine‑tuning, thereby enhancing the model's robustness to LQ images. To validate the effectiveness of our framework, we construct a real‑world LQ image segmentation dataset with high‑quality annotations, and conduct extensive experiments on both synthetic and real‑world LQ benchmarks. The results show that SCR and RASS significantly outperform state‑of‑the‑art methods in segmentation and restoration tasks. Code, models, and datasets will be available at https://github.com/Ka1Guan/RASS.git.

Abstract:
Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine‑grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual‑guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo‑labels from zero‑shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high‑precision segmentations. At the heart of DynaGuide is a multi‑component loss that dynamically balances feature similarity, Huber‑smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo‑labels. Unlike prior approaches, DynaGuide trains entirely without ground‑truth labels in the target domain and supports plug‑and‑play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state‑of‑the‑art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real‑world settings. Code available at: https://github.com/RyersonMultimediaLab/DynaGuide

Abstract:
Vision‑language segmentation models such as SAM3 enable flexible, prompt‑driven visual grounding, but inherit large, general‑purpose text encoders originally designed for open‑ended language understanding. In practice, segmentation prompts are short, structured, and semantically constrained, leading to substantial over‑provisioning in text encoder capacity and persistent computational and memory overhead. In this paper, we perform a large‑scale anatomical analysis of text prompting in vision‑language segmentation, covering 404,796 real prompts across multiple benchmarks. Our analysis reveals severe redundancy: most context windows are underutilized, vocabulary usage is highly sparse, and text embeddings lie on low‑dimensional manifold despite high‑dimensional representations. Motivated by these findings, we propose SAM3‑LiteText, a lightweight text encoding framework that replaces the original SAM3 text encoder with a compact MobileCLIP student that is optimized by knowledge distillation. Extensive experiments on image and video segmentation benchmarks show that SAM3‑LiteText reduces text encoder parameters by up to 88%, substantially reducing static memory footprint, while maintaining segmentation performance comparable to the original model. Code: https://github.com/SimonZeng7108/efficientsam3/tree/sam3_litetext.

Abstract:
Reliable 3D instance segmentation is fundamental to language‑grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R‑Seg, a zero‑shot pipeline for robust 3D instance segmentation for language‑grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross‑view grouping and conditional substitution, the tree suppresses over‑ and under‑segmentation, yielding view‑consistent masks and robust 3D instances. Each instance is enriched with open‑vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi‑stage tasks, we further introduce a consistency‑aware update that preserves instance correspondences from only a single post‑interaction image, allowing efficient adaptation without rescanning. Clutt3R‑Seg is evaluated on both synthetic and real‑world datasets, and validated on a real robot. Across all settings, it consistently outperforms state‑of‑the‑art baselines in cluttered and sparse‑view scenarios. Even on the most challenging heavy‑clutter sequences, Clutt3R‑Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r‑seg.

Abstract:
Query‑based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic‑spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate‑guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual‑path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large‑scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at https://github.com/RayYoh/LaSSM.

Abstract:
Self‑driving cars hold significant potential to reduce traffic accidents, alleviate congestion, and enhance urban mobility. However, developing reliable AI systems for autonomous vehicles remains a substantial challenge. Over the past decade, multi‑task learning has emerged as a powerful approach to address complex problems in driving perception. Multi‑task networks offer several advantages, including increased computational efficiency, real‑time processing capabilities, optimized resource utilization, and improved generalization. In this study, we present AurigaNet, an advanced multi‑task network architecture designed to push the boundaries of autonomous driving perception. AurigaNet integrates three critical tasks: object detection, lane detection, and drivable area instance segmentation. The system is trained and evaluated using the BDD100K dataset, renowned for its diversity in driving conditions. Key innovations of AurigaNet include its end‑to‑end instance segmentation capability, which significantly enhances both accuracy and efficiency in path estimation for autonomous vehicles. Experimental results demonstrate that AurigaNet achieves an 85.2% IoU in drivable area segmentation, outperforming its closest competitor by 0.7%. In lane detection, AurigaNet achieves a remarkable 60.8% IoU, surpassing other models by more than 30%. Furthermore, the network achieves an mAP@0.5:0.95 of 47.6% in traffic object detection, exceeding the next leading model by 2.9%. Additionally, we validate the practical feasibility of AurigaNet by deploying it on embedded devices such as the Jetson Orin NX, where it demonstrates competitive real‑time performance. These results underscore AurigaNet's potential as a robust and efficient solution for autonomous driving perception systems. The code can be found here https://github.com/KiaRational/AurigaNet.

Abstract:
Synthetic data provide low‑cost, accurately annotated samples for geometry‑sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic‑to‑real translation can reduce this gap without paired supervision, yet existing methods often face a trade‑off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real‑domain statistics. We propose FD‑DB, a frequency‑decoupled dual‑branch model that separates appearance transfer into low‑frequency interpretable editing and high‑frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low‑frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low‑frequency drift. We further adopt a two‑stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB‑V dataset show that FD‑DB improves real‑domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.

Abstract:
Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real‑time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post‑training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full‑token computation redundant. With these insights, we propose Efficient‑SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task‑irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object‑aware Sparse Window Routing (SWR), a window‑level computation allocation mechanism that leverages the consistency and saliency cues from the previous‑frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object‑aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient‑SAM2 delivers 1.68x speedup on SAM2.1‑L model with only 1.0% accuracy drop on SA‑V test set.

Abstract:
Recent self‑supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few‑shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training‑free baseline, FSSDINO, utilizing class‑specific prototypes and Gram‑matrix refinement. Our results across binary, multi‑class, and cross‑domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test‑time adaptation. Crucially, we conduct an Oracle‑guided layer analysis, identifying a significant performance gap between the standard last‑layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute‑intensive adaptation methods, current unsupervised and support‑guided selection metrics consistently yield lower performance than the last‑layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high‑fidelity features. Our work establishes the "Last‑Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.

Abstract:
Unsupervised object‑centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre‑trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug‑and‑play framework for slot attention‑based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object‑centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

Abstract:
Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed‑forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per‑scene optimization of prior work with our feed‑forward lifting approach, our framework avoids feature‑averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi‑view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat‑and‑Distill/

Abstract:
Long‑tailed class imbalance remains a fundamental obstacle in semantic segmentation of high‑resolution remote‑sensing imagery, where dominant classes shape learned representations and rare classes are systematically under‑segmented. This challenge becomes more acute in cross‑domain settings such as LoveDA, which exhibits an explicit Urban/Rural split with substantial appearance differences and inconsistent class‑frequency statistics across domains. We propose a prompt‑controlled diffusion augmentation framework that generates paired label‑image samples with explicit control over semantic composition and domain, enabling targeted enrichment of underrepresented classes rather than indiscriminate dataset expansion. A domain‑aware, masked, ratio‑conditioned discrete diffusion model first synthesizes layouts that satisfy class‑ratio targets while preserving realistic spatial co‑occurrence, and a ControlNet‑guided diffusion model then renders photorealistic, domain‑consistent images from these layouts. When mixed with real data, the resulting synthetic pairs improve multiple segmentation backbones, especially on minority classes and under domain shift, showing that better downstream segmentation comes from adding the right samples in the right proportions. Source codes, pretrained models, and synthetic datasets are available at \hrefhttps://buddhi19.github.io/SyntheticGen\textttbuddhi19.github.io/SyntheticGen.

Abstract:
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real‑world scenarios that involve up‑to‑date information or domain‑specific concepts. In this work, we propose Seg‑ReSearch, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg‑ReSearch empowers segmentation systems to handle dynamic, open‑world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step‑wise supervision. For evaluation, we construct OK‑VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK‑VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg‑ReSearch improves state‑of‑the‑art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE‑Laboratory/Seg‑ReSearch.

Abstract:
We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task‑specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof‑the‑art robustness on 3D‑Front and ADE20K datasets. Project Page: https://rioak.github.io/seeingthroughclutter/

Abstract:
Semantic segmentation networks require large amounts of pixel‑level annotated data, which are costly to obtain for real‑world images. Computer graphics engines can generate synthetic images alongside their ground‑truth annotations. However, models trained on such images can perform poorly on real images due to the domain gap between real and synthetic images. Style transfer methods can reduce this difference by applying a realistic style to synthetic images. Choosing effective data transformations and their sequence is difficult due to the large combinatorial search space of style transfer operators. Using multi‑objective genetic algorithms, we optimize pipelines to balance structural coherence and style similarity to target domains. We study the use of paired‑image metrics on individual image samples during evolution to enable rapid pipeline evaluation, as opposed to standard distributional metrics that require the generation of many images. After optimization, we evaluate the resulting Pareto front using distributional metrics and segmentation performance. We apply this approach to standard datasets in synthetic‑to‑real domain adaptation: from the video game GTA5 to real image datasets Cityscapes and ACDC, focusing on adverse conditions. Results demonstrate that evolutionary algorithms can propose diverse augmentation pipelines adapted to different objectives. The contribution of this work is the formulation of style transfer as a sequencing problem suitable for evolutionary optimization and the study of efficient metrics that enable feasible search in this space. The source code is available at: https://github.com/echigot/MOOSS.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large‑scale supervised fine‑tuning (SFT) of Multi‑modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero‑shot approaches offer a flexible alternative, their performance remains significantly behind SFT‑based methods, due to the straightforward workflow designs. To address these limitations, we propose Refer‑Agent, a collaborative multi‑agent system with alternating reasoning‑reflection mechanisms. This system decomposes RVOS into step‑by‑step reasoning process. During reasoning, we introduce a Coarse‑to‑Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain‑of‑Reflection mechanism, which employs a Questioner‑Responder pair to generate a self‑reflection chain, enabling the system to verify intermediate results and generates feedback for next‑round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer‑Agent significantly outperforms state‑of‑the‑art methods, including both SFT‑based models and zero‑shot approaches. Moreover, Refer‑Agent is flexible and enables fast integration of new MLLMs without any additional fine‑tuning costs. Code will be released at https://github.com/iSEE‑Laboratory/Refer‑Agent.

Abstract:
3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero‑shot 3DVG from multi‑view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi‑view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero‑shot methods causing significant performance degradation and address them with (i) a state‑of‑the‑art zero‑shot 3D instance segmentation method to generate high‑quality 3D bounding box proposals and (ii) advanced reasoning via prompt‑based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state‑of‑the‑art performance among zero‑shot methods. Code is available at https://github.com/col14m/z3d .

Abstract:
The semi‑supervised semantic segmentation (S4) can learn rich visual knowledge from low‑cost unlabeled images. However, traditional S4 architectures all face the challenge of low‑quality pseudo‑labels, especially for the teacher‑student framework.We propose a novel SemiEarth model that introduces vision‑language models (VLMs) to address the S4 issues for the remote sensing (RS) domain. Specifically, we invent a VLM pseudo‑label purifying (VLM‑PP) structure to purify the teacher network's pseudo‑labels, achieving substantial improvements. Especially in multi‑class boundary regions of RS images, the VLM‑PP module can significantly improve the quality of pseudo‑labels generated by the teacher, thereby correctly guiding the student model's learning. Moreover, since VLM‑PP equips VLMs with open‑world capabilities and is independent of the S4 architecture, it can correct mispredicted categories in low‑confidence pseudo‑labels whenever a discrepancy arises between its prediction and the pseudo‑label. We conducted extensive experiments on multiple RS datasets, which demonstrate that our SemiEarth achieves SOTA performance. More importantly, unlike previous SOTA RS S4 methods, our model not only achieves excellent performance but also offers good interpretability. The code is released at https://github.com/wangshanwen001/SemiEarth.

Abstract:
Scene understanding with free‑form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic‑level understanding. We introduce SEAL, the first Semantic‑aware Segment Any Events framework that addresses Open‑Vocabulary Event Instance Segmentation (OV‑EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open‑vocabulary mask classification at multiple levels of granularity, including instance‑level and part‑level. To enable thorough evaluation on OV‑EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance‑level to part‑level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter‑efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV‑EIS that does not require any visual prompts from users in the inference. Check out our project page in https://0nandon.github.io/SEAL

Abstract:
High‑resolution remote sensing images contain densely distributed objects with pronounced scale variations and complex boundaries, which impose higher demands on both the geometric localization and semantic prediction capabilities of semantic segmentation models. Existing training‑free open‑vocabulary semantic segmentation (OVSS) methods typically fuse Contrastive Language‑Image Pretraining (CLIP) and vision foundation models (VFMs) using one‑way injection and shallow post‑processing strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial‑regularization‑aware dual‑branch collaborative inference framework for training‑free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross‑model attention fusion (CAF) module, which guides collaborative inference by injecting self‑attention maps into each other. Second, we propose a bidirectional cross‑graph diffusion refinement (BCDR) module that enhances the reliability of dual‑branch segmentation scores through iterative random‑walk diffusion. Finally, we incorporate low‑level superpixel structures and develop a convex‑optimization‑based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Our code is available at https://github.com/yu‑ni1989/SDCI.

Abstract:
Nuclei panoptic segmentation supports cancer diagnostics by integrating both semantic and instance segmentation of different cell types to analyze overall tissue structure and individual nuclei in histopathology images. Major challenges include detecting small objects, handling ambiguous boundaries, and addressing class imbalance. To address these issues, we propose PanopMamba, a novel hybrid encoder‑decoder architecture that integrates Mamba and Transformer with additional feature‑enhanced fusion via state space modeling. We design a multiscale Mamba backbone and a State Space Model (SSM)‑based fusion network to enable efficient long‑range perception in pyramid features, thereby extending the pure encoder‑decoder framework while facilitating information sharing across multiscale features of nuclei. The proposed SSM‑based feature‑enhanced fusion integrates pyramid feature networks and dynamic feature enhancement across different spatial scales, enhancing the feature representation of densely overlapping nuclei in both semantic and spatial dimensions. To the best of our knowledge, this is the first Mamba‑based approach for panoptic segmentation. Additionally, we introduce alternative evaluation metrics, including image‑level Panoptic Quality (iPQ), boundary‑weighted PQ (wPQ), and frequency‑weighted PQ (fwPQ), which are specifically designed to address the unique challenges of nuclei segmentation and thereby mitigate the potential bias inherent in vanilla PQ. Experimental evaluations on two multiclass nuclei segmentation benchmark datasets, MoNuSAC2020 and NuInsSeg, demonstrate the superiority of PanopMamba for nuclei panoptic segmentation over state‑of‑the‑art methods. Consequently, the robustness of PanopMamba is validated across various metrics, while the distinctiveness of PQ variants is also demonstrated. Code is available at https://github.com/mkang315/PanopMamba.

Abstract:
Accurate semantic segmentation for histopathology image is crucial for quantitative tissue analysis and downstream clinical modeling. Recent segmentation foundation models have improved generalization through large‑scale pretraining, yet remain poorly aligned with pathology because they treat segmentation as a static visual prediction task. Here we present VISTA‑PATH, an interactive, class‑aware pathology segmentation foundation model designed to resolve heterogeneous structures, incorporate expert feedback, and produce pixel‑level segmentation that are directly meaningful for clinical interpretation. VISTA‑PATH jointly conditions segmentation on visual context, semantic tissue descriptions, and optional expert‑provided spatial prompts, enabling precise multi‑class segmentation across heterogeneous pathology images. To support this paradigm, we curate VISTA‑PATH Data, a large‑scale pathology segmentation corpus comprising over 1.6 million image‑mask‑text triplets spanning 9 organs and 93 tissue classes. Across extensive held‑out and external benchmarks, VISTA‑PATH consistently outperforms existing segmentation foundation models. Importantly, VISTA‑PATH supports dynamic human‑in‑the‑loop refinement by propagating sparse, patch‑level bounding‑box annotation feedback into whole‑slide segmentation. Finally, we show that the high‑fidelity, class‑aware segmentation produced by VISTA‑PATH is a preferred model for computational pathology. It improve tissue microenvironment analysis through proposed Tumor Interaction Score (TIS), which exhibits strong and significant associations with patient survival. Together, these results establish VISTA‑PATH as a foundation model that elevates pathology image segmentation from a static prediction to an interactive and clinically grounded representation for digital pathology. Source code and demo can be found at https://github.com/zhihuanglab/VISTA‑PATH.

Abstract:
Most pseudo‑label selection strategies in semi‑supervised learning rely on fixed confidence thresholds, implicitly assuming that prediction confidence reliably indicates correctness. In practice, deep networks are often overconfident: high‑confidence predictions can still be wrong, while informative low‑confidence samples near decision boundaries are discarded. This paper introduces a Confidence‑Variance (CoVar) theory framework that provides a principled joint reliability criterion for pseudo‑label selection. Starting from the entropy minimization principle, we derive a reliability measure that combines maximum confidence (MC) with residual‑class variance (RCV), which characterizes how probability mass is distributed over non‑maximum classes. The derivation shows that reliable pseudo‑labels should have both high MC and low RCV, and that the influence of RCV increases as confidence grows, thereby correcting overconfident but unstable predictions. From this perspective, we cast pseudo‑label selection as a spectral relaxation problem that maximizes separability in a confidence‑variance feature space, and design a threshold‑free selection mechanism to distinguish high‑ from low‑reliability predictions. We integrate CoVar as a plug‑in module into representative semi‑supervised semantic segmentation and image classification methods. Across PASCAL VOC 2012, Cityscapes, CIFAR‑10, and Mini‑ImageNet with varying label ratios and backbones, it consistently improves over strong baselines, indicating that combining confidence with residual‑class variance provides a more reliable basis for pseudo‑label selection than fixed confidence thresholds. (Code: https://github.com/ljs11528/CoVar_Pseudo_Label_Selection.git)

Abstract:
As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio‑semantic segmentation by vision‑language model reasoning. To facilitate this, we introduce the Urban Socio‑Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel‑level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision‑language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross‑modal recognition and multi‑stage reasoning. We employ reinforcement learning to optimize this non‑differentiable process and elicit the reasoning capabilities of the vision‑language model. Experiments demonstrate our approach's gains over state‑of‑the‑art models and strong zero‑shot generalization. The dataset and code are open‑sourced under the Apache License 2.0 at https://github.com/AMAP‑ML/SocioReasoner.

Abstract:
Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group‑level collective memory selection is suboptimal for complex multi‑object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3‑DMS, a training‑free decoupled strategy that utilizes fine‑grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi‑target video segmentation in the wild.

Abstract:
Purpose: Myocardium segmentation in echocardiography videos is a challenging task due to low contrast, noise, and anatomical variability. Traditional deep learning models either process frames independently, ignoring temporal information, or rely on memory‑based feature propagation, which accumulates error over time. Methods: We propose Point‑Seg, a transformer‑based segmentation framework that integrates point tracking as a temporal cue to ensure stable and consistent segmentation of myocardium across frames. Our method leverages a point‑tracking module trained on a synthetic echocardiography dataset to track key anatomical landmarks across video sequences. These tracked trajectories provide an explicit motion‑aware signal that guides segmentation, reducing drift and eliminating the need for memory‑based feature accumulation. Additionally, we incorporate a temporal smoothing loss to further enhance temporal consistency across frames. Results: We evaluate our approach on both public and private echocardiography datasets. Experimental results demonstrate that Point‑Seg has statistically similar accuracy in terms of Dice to state‑of‑the‑art segmentation models in high quality echo data, while it achieves better segmentation accuracy in lower quality echo with improved temporal stability. Furthermore, Point‑Seg has the key advantage of pixel‑level myocardium motion information as opposed to other segmentation methods. Such information is essential in the computation of other downstream tasks such as myocardial strain measurement and regional wall motion abnormality detection. Conclusion: Point‑Seg demonstrates that point tracking can serve as an effective temporal cue for consistent video segmentation, offering a reliable and generalizable approach for myocardium segmentation in echocardiography videos. The code is available at https://github.com/DeepRCL/PointSeg.

Abstract:
Accurately localizing and segmenting relevant objects from optical remote sensing images (ORSIs) is critical for advancing remote sensing applications. Existing methods are typically built upon moderate‑scale pre‑trained models and employ diverse optimization strategies to achieve promising performance under full‑parameter fine‑tuning. In fact, deeper and larger‑scale foundation models can provide stronger support for performance improvement. However, due to their massive number of parameters, directly adopting full‑parameter fine‑tuning leads to pronounced training difficulties, such as excessive GPU memory consumption and high computational costs, which result in extremely limited exploration of large‑scale models in existing works. In this paper, we propose a novel dynamic wavelet expert‑guided fine‑tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large‑scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. Specifically, we introduce a task‑specific wavelet expert extractor to model wavelet experts from different perspectives and dynamically regulate their outputs, thereby generating trainable features enriched with task‑specific information for subsequent fine‑tuning. Furthermore, we construct an expert‑guided conditional adapter that first enhances the fine‑grained perception of frozen features for specific tasks by injecting trainable features, and then iteratively updates the information of both types of feature, allowing for efficient fine‑tuning. Extensive experiments show that our WEFT not only outperforms 21 state‑of‑the‑art (SOTA) methods on three ORSIs datasets, but also achieves optimal results in camouflage, natural, and medical scenarios. The source code is available at: https://github.com/CSYSI/WEFT.

Abstract:
Segment Anything (SAM) provides an unprecedented foundation for human segmentation, but may struggle under occlusion, where keypoints may be partially or fully invisible. We adapt SAM 2.1 for pose‑guided segmentation with minimal encoder modifications, retaining its strong generalization. Using a fine‑tuning strategy called PoseMaskRefine, we incorporate pose keypoints with high visibility into the iterative correction process originally employed by SAM, yielding improved robustness and accuracy across multiple datasets. During inference, we simplify prompting by selecting only the three keypoints with the highest visibility. This strategy reduces sensitivity to common errors, such as missing body parts or misclassified clothing, and allows accurate mask prediction from as few as a single keypoint. Our results demonstrate that pose‑guided fine‑tuning of SAM enables effective, occlusion‑aware human segmentation while preserving the generalization capabilities of the original model. The code and pretrained models will be available at https://mirapurkrabek.github.io/BBox‑Mask‑Pose/.

Abstract:
Video object segmentation methods like SAM2 achieve strong performance through memory‑based architectures but struggle under large viewpoint changes due to reliance on appearance features. Traditional 3D instance segmentation methods address viewpoint consistency but require camera poses, depth maps, and expensive preprocessing. We introduce 3AM, a training‑time enhancement that integrates 3D‑aware features from MUSt3R into SAM2. Our lightweight Feature Merger fuses multi‑level MUSt3R features that encode implicit geometric correspondence. Combined with SAM2's appearance features, the model achieves geometry‑consistent recognition grounded in both spatial position and visual similarity. We propose a field‑of‑view aware sampling strategy ensuring frames observe spatially consistent object regions for reliable 3D correspondence learning. Critically, our method requires only RGB input at inference, with no camera poses or preprocessing. On challenging datasets with wide‑baseline motion (ScanNet++, Replica), 3AM substantially outperforms SAM2 and extensions, achieving 90.6% IoU and 71.7% Tracking Recall on ScanNet++'s Selected Subset, improving over state‑of‑the‑art VOS methods by +15.9 and +30.4 points. Project page: https://jayisaking.github.io/3AM‑Page/

Abstract:
Diagnosing dental diseases from radiographs is time‑consuming and challenging due to the subtle nature of diagnostic evidence. Existing methods, which rely on object detection models designed for natural images with more distinct target patterns, struggle to detect dental diseases that present with far less visual support. To address this challenge, we propose \bf DentalX, a novel context‑aware dental disease detection approach that leverages oral structure information to mitigate the visual ambiguity inherent in radiographs. Specifically, we introduce a structural context extraction module that learns an auxiliary task: semantic segmentation of dental anatomy. The module extracts meaningful structural context and integrates it into the primary disease detection task to enhance the detection of subtle dental diseases. Extensive experiments on a dedicated benchmark demonstrate that DentalX significantly outperforms prior methods in both tasks. This mutual benefit arises naturally during model optimization, as the correlation between the two tasks is effectively captured. Our code is available at https://github.com/zhiqin1998/DentYOLOX.

Abstract:
Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave‑based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency‑from low‑frequency global layout to high‑frequency edges and textures‑is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed‑form, frequency‑time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time‑far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop‑in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention‑based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat‑based methods, effectively capturing both global coherence and high‑frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.

Abstract:
Semantic segmentation of 3D geospatial point clouds is fundamental to remote sensing applications, yet domain shifts caused by regional and acquisition‑related variations often degrade model performance. Although domain adaptation can mitigate such shifts, existing methods typically require access to source‑domain data, which is often infeasible due to privacy concerns and regulatory policies. To address this, we propose LoGo (Local‑Global Dual‑Consensus), a novel source‑free unsupervised domain adaptation (SFUDA) framework requiring only a pretrained model and unlabeled target data. At the local level, we introduce a class‑balanced prototype estimation module that ensures that robust feature prototypes can be generated even for sample‑scarce tail classes, effectively mitigating the feature collapse caused by long‑tailed distributions. At the global level, we introduce an optimal transport‑based global distribution alignment module that formulates pseudo‑label assignment as a global optimization problem, effectively correcting the over‑dominance of head classes inherent in local greedy assignments, and thereby preventing model predictions from being severely biased towards majority classes. Finally, we propose a dual‑consistency pseudo‑label filtering mechanism that retains only high‑confidence pseudo‑labels where local multi‑augmented ensemble predictions align with global optimal transport assignments for self‑training. Extensive experiments on two challenging benchmarks, encompassing cross‑scene and cross‑sensor settings, demonstrate that LoGo consistently outperforms existing state‑of‑the‑art methods. The source code is available at https://github.com/GYproject/LoGo‑SFUDA.

Abstract:
Remote sensing change detection fundamentally relies on the effective fusion and discrimination of bi‑temporal features. Prevailing paradigms typically utilize Siamese encoders bridged by explicit difference computation modules, such as subtraction or concatenation, to identify changes. In this work, we challenge this complexity with SEED (Siamese Encoder‑Exchange‑Decoder), a streamlined paradigm that replaces explicit differencing with parameter‑free feature exchange. By sharing weights across both Siamese encoders and decoders, SEED effectively operates as a single parameter set model. Theoretically, we formalize feature exchange as an orthogonal permutation operator and prove that, under pixel consistency, this mechanism preserves mutual information and Bayes optimal risk, whereas common arithmetic fusion methods often introduce information loss. Extensive experiments across five benchmarks, including SYSU‑CD, LEVIR‑CD, PX‑CLCD, WaterCD, and CDD, and three backbones, namely SwinT, EfficientNet, and ResNet, demonstrate that SEED matches or surpasses state of the art methods despite its simplicity. Furthermore, we reveal that standard semantic segmentation models can be transformed into competitive change detectors solely by inserting this exchange mechanism, referred to as SEG2CD. The proposed paradigm offers a robust, unified, and interpretable framework for change detection, demonstrating that simple feature exchange is sufficient for high performance information fusion. Code and full training and evaluation protocols will be released at https://github.com/dyzy41/open‑rscd.

Abstract:
Food image segmentation is a critical task for dietary analysis, enabling accurate estimation of food volume and nutrients. However, current methods suffer from limited multi‑view data and poor generalization to new viewpoints. We introduce BenchSeg, a novel multi‑view food video segmentation dataset and benchmark. BenchSeg aggregates 55 dish scenes (from Nutrition5k, Vegetables & Fruits, MetaFood3D, and FoodKit) with 25,284 meticulously annotated frames, capturing each dish under free 360° camera motion. We evaluate a diverse set of 20 state‑of‑the‑art segmentation models (e.g., SAM‑based, transformer, CNN, and large multimodal) on the existing FoodSeg103 dataset and evaluate them (alone and combined with video‑memory modules) on BenchSeg. Quantitative and qualitative results demonstrate that while standard image segmenters degrade sharply under novel viewpoints, memory‑augmented methods maintain temporal consistency across frames. Our best model based on a combination of SeTR‑MLA+XMem2 outperforms prior work (e.g., improving over FoodMem by ~2.63% mAP), offering new insights into food segmentation and tracking for dietary analysis. In addition to frame‑wise spatial accuracy, we introduce a dedicated temporal evaluation protocol that explicitly quantifies segmentation stability over time through continuity, flicker rate, and IoU drift metrics. This allows us to reveal failure modes that remain invisible under standard per‑frame evaluations. We release BenchSeg to foster future research. The project page including the dataset annotations and the food segmentation models can be found at https://amughrabi.github.io/benchseg.

Abstract:
Existing image foundation models are not optimized for spherical images having been trained primarily on perspective images. PanoSAMic integrates the pre‑trained Segment Anything (SAM) encoder to make use of its extensive training and integrate it into a semantic segmentation model for panoramic images using multiple modalities. We modify the SAM encoder to output multi‑stage features and introduce a novel spatio‑modal fusion module that allows the model to select the relevant modalities and best features from each modality for different areas of the input. Furthermore, our semantic decoder uses spherical attention and dual view fusion to overcome the distortions and edge discontinuity often associated with panoramic images. PanoSAMic achieves state‑of‑the‑art (SotA) results on Stanford2D3DS for RGB, RGB‑D, and RGB‑D‑N modalities and on Matterport3D for RGB and RGB‑D modalities. https://github.com/dfki‑av/PanoSAMic

Abstract:
Crop mapping based on satellite images time‑series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model's resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student's feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full‑sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.

Abstract:
Precise and scalable instance segmentation of cell nuclei is essential for computational pathology, yet gigapixel Whole‑Slide Images pose major computational challenges. Existing approaches rely on patch‑based processing and costly post‑processing for instance separation, sacrificing context and efficiency. We introduce LSP‑DETR (Local Star Polygon DEtection TRansformer), a fully end‑to‑end framework that uses a lightweight transformer with linear complexity to process substantially larger images without additional computational cost. Nuclei are represented as star‑convex polygons, and a novel radial distance loss function allows the segmentation of overlapping nuclei to emerge naturally, without requiring explicit overlap annotations or handcrafted post‑processing. Evaluations on PanNuke and MoNuSeg show strong generalization across tissues and state‑of‑the‑art efficiency, with LSP‑DETR being over five times faster than the next‑fastest leading method. Code and models are available at https://github.com/RationAI/lsp‑detr.

Abstract:
Geo‑Foundation Models (GFMs), have proven effective in diverse downstream applications, including semantic segmentation, classification, and regression tasks. However, in case of flood mapping using Sen1Flood11 dataset as a downstream task, GFMs struggles to outperform the baseline U‑Net, highlighting model's limitation in capturing critical local nuances. To address this, we present the Prithvi‑Complementary Adaptive Fusion Encoder (CAFE), which integrate Prithvi GFM pretrained encoder with a parallel CNN residual branch enhanced by Convolutional Attention Modules (CAM). Prithvi‑CAFE enables fast and efficient fine‑tuning through adapters in Prithvi and performs multi‑scale, multi‑level fusion with CNN features, capturing critical local details while preserving long‑range dependencies. We achieve state‑of‑the‑art results on two comprehensive flood mapping datasets: Sen1Flood11 and FloodPlanet. On Sen1Flood11 test data, Prithvi‑CAFE (IoU 83.41) outperforms the original Prithvi (IoU 82.50) and other major GFMs (TerraMind 82.90, DOFA 81.54, spectralGPT: 81.02). The improvement is even more pronounced on the hold‑out test site, where Prithvi‑CAFE achieves an IoU of 81.37 compared to the baseline U‑Net (70.57) and original Prithvi (72.42). On FloodPlanet, Prithvi‑CAFE also surpasses the baseline U‑Net and other GFMs, achieving an IoU of 64.70 compared to U‑Net (60.14), Terramind (62.33), DOFA (59.15) and Prithvi 2.0 (61.91). Our proposed simple yet effective Prithvi‑CAFE demonstrates strong potential for improving segmentation tasks where multi‑channel and multi‑modal data provide complementary information and local details are critical. The code is released on \hrefhttps://github.com/Sk‑2103/Prithvi‑CAFEPrithvi‑CAFE Github

Abstract:
Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero‑shot generalization through large‑scale pretraining, but adapting them to domain‑specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine‑tuning is computationally expensive and risks catastrophic forgetting. We propose TopoLoRA‑SAM, a topology‑aware and parameter‑efficient adaptation framework for binary semantic segmentation. TopoLoRA‑SAM injects Low‑Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology‑aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE\_DB1), polyp segmentation (Kvasir‑SEG), and SAR sea/land segmentation (SL‑SSDD), comparing against U‑Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA‑SAM achieves the best retina‑average Dice and the best overall average Dice across datasets, while training only 5.2% of model parameters (～4.9M). On the challenging CHASE\_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology‑aware parameter‑efficient adaptation can match or exceed fully fine‑tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git

Abstract:
Self‑supervised learning (SSL) methods have become a dominant paradigm for creating general purpose models whose capabilities can be transferred to downstream supervised learning tasks. However, most such methods rely on vast amounts of pretraining data. This work introduces Subimage Overlap Prediction, a novel self‑supervised pretraining task to aid semantic segmentation in remote sensing imagery that uses significantly lesser pretraining imagery. Given an image, a sub‑image is extracted and the model is trained to produce a semantic mask of the location of the extracted sub‑image within the original image. We demonstrate that pretraining with this task results in significantly faster convergence, and equal or better performance (measured via mIoU) on downstream segmentation. This gap in convergence and performance widens when labeled training data is reduced. We show this across multiple architecture types, and with multiple downstream datasets. We also show that our method matches or exceeds performance while requiring significantly lesser pretraining data relative to other SSL methods. Code and model weights are provided at \hrefhttps://github.com/sharmalakshay93/subimage‑overlap‑predictiongithub.com/sharmalakshay93/subimage‑overlap‑prediction.

Abstract:
To address the scarcity of high‑quality part annotations in existing datasets, we introduce PartImageNet++ (PIN++), a dataset that provides detailed part annotations for all categories in ImageNet‑1K. With 100 annotated images per category, totaling 100K images, PIN++ represents the most comprehensive dataset covering a diverse range of object categories. Leveraging PIN++, we propose a Multi‑scale Part‑supervised recognition Model (MPM) for robust classification on ImageNet‑1K. We first trained a part segmentation network using PIN++ and used it to generate pseudo part labels for the remaining unannotated images. MPM then integrated a conventional recognition architecture with auxiliary bypass layers, jointly supervised by both pseudo part labels and the original part annotations. Furthermore, we conducted extensive experiments on PIN++, including part segmentation, object segmentation, and few‑shot learning, exploring various ways to leverage part annotations in downstream tasks. Experimental results demonstrated that our approach not only enhanced part‑based models for robust object recognition but also established strong baselines for multiple downstream tasks, highlighting the potential of part annotations in improving model performance. The dataset and the code are available at https://github.com/LixiaoTHU/PartImageNetPP.

Abstract:
Semantic segmentation is a fundamental problem in computer vision and it requires high‑resolution feature maps for dense prediction. Current coordinate‑guided low‑resolution feature interpolation methods, e.g., bilinear interpolation, produce coarse high‑resolution features which suffer from feature misalignment and insufficient context information. Moreover, enriching semantics to high‑resolution features requires a high computation burden, so that it is challenging to meet the requirement of lowlatency inference. We propose a novel Guided Attentive Interpolation (GAI) method to adaptively interpolate fine‑grained high‑resolution features with semantic features to tackle these issues. Guided Attentive Interpolation determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high‑resolution features with rich semantics. GAI can be integrated with any deep convolutional network for efficient semantic segmentation. In experiments, the GAI‑based semantic segmentation networks, i.e., GAIN, can achieve78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU, which are the new state‑of‑the‑art results of low‑latency semantic segmentation. Code and models are available at: https://github.com/hustvl/simpleseg.

Abstract:
We introduce a novel FSVOS model that employs a local matching strategy to restrict the search space to the most relevant neighboring pixels. Rather than relying on inefficient standard im2col‑like implementations (e.g., spatial convolutions, depthwise convolutions and feature‑shifting mechanisms) or hardware‑specific CUDA kernels (e.g., deformable and neighborhood attention), which often suffer from limited portability across non‑CUDA devices, we reorganize the local sampling process through a direction‑based sampling perspective. Specifically, we implement a non‑parametric sampling mechanism that enables dynamically varying sampling regions. This approach provides the flexibility to adapt to diverse spatial structures without the computational costs of parametric layers and the need for model retraining. To further enhance feature coherence across frames, we design a supervised spatio‑temporal contrastive learning scheme that enforces consistency in feature representations. In addition, we introduce a publicly available benchmark dataset for multi‑object segmentation in X‑ray angiography videos (MOSXAV), featuring detailed, manually labeled segmentation ground truth. Extensive experiments on the CADICA, XACV, and MOSXAV datasets show that our proposed FSVOS method outperforms current state‑of‑the‑art video segmentation methods in terms of segmentation accuracy and generalization capability (i.e., seen and unseen categories). This work offers enhanced flexibility and potential for a wide range of clinical applications. Code is available at https://github.com/xilin‑x/XRAVOS

Abstract:
Semantic segmentation is crucial for medical image analysis, enabling precise disease diagnosis and treatment planning. However, many advanced models employ complex architectures, limiting their use in resource‑constrained clinical settings. This paper proposes MFEnNet, an efficient medical image segmentation framework that incorporates MetaFormer in the encoding phase of the U‑Net backbone. MetaFormer, an architectural abstraction of vision transformers, provides a versatile alternative to convolutional neural networks by transforming tokenized image patches into sequences for global context modeling. To mitigate the substantial computational cost associated with self‑attention, the proposed framework replaces conventional transformer modules with pooling transformer blocks, thereby achieving effective global feature aggregation at reduced complexity. In addition, Swish activation is used to achieve smoother gradients and faster convergence, while spatial pyramid pooling is incorporated at the bottleneck to improve multi‑scale feature extraction. Comprehensive experiments on different medical segmentation benchmarks demonstrate that the proposed MFEnNet approach attains competitive accuracy while significantly lowering computational cost compared to state‑of‑the‑art models. The source code for this work is available at https://github.com/tranleanh/mfennet.

Abstract:
Despite recent progress in 3D self‑supervised learning, collecting large‑scale 3D scene scans remains expensive and labor‑intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian‑Aware Multi‑level 3D Clustering with Sinkhorn‑Knopp (LAM3C), a self‑supervised framework that learns from video‑generated point clouds reconstructed from unlabeled videos. We first introduce RoomTours, a video‑generated point cloud dataset constructed by collecting room‑walkthrough videos from the web (e.g., real‑estate tours) and generating 49,219 scenes using an off‑the‑shelf feed‑forward reconstruction model. We also propose a noise‑regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves better performance than previous self‑supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self‑supervised learning. Our source code is available at https://ryosuke‑yamada.github.io/lam3c/.

Abstract:
Semi‑supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo‑label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi‑supervised RS segmentation framework that synergistically fuses priors from vision‑language models and self‑supervised models. Specifically, we construct a heterogeneous dual‑student architecture comprising two distinct ViT‑based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo‑label drift. To effectively incorporate these distinct priors, an explicit‑implicit semantic co‑guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class‑level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global‑local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.

Abstract:
Real‑time instance segmentation for spinal endoscopy is important for identifying and protecting critical anatomy during surgery, but it is difficult because of the narrow field of view, specular highlights, smoke/bleeding, unclear boundaries, and large scale changes. Deployment is also constrained by limited surgical hardware, so the model must balance accuracy and speed and remain stable under small‑batch (even batch‑1) training. We propose LMSF‑A, a lightweight multi‑scale attention framework co‑designed across backbone, neck, and head. The backbone uses a C2f‑Pro module that combines RepViT‑style re‑parameterized convolution (RVB) with efficient multi‑scale attention (EMA), enabling multi‑branch training while collapsing into a single fast path for inference. The neck improves cross‑scale consistency and boundary detail using Scale‑Sequence Feature Fusion (SSFF) and Triple Feature Encoding (TFE), which strengthens high‑resolution features. The head adopts a Lightweight Multi‑task Shared Head (LMSH) with shared convolutions and GroupNorm to reduce parameters and support batch‑1 stability. We also release the clinically reviewed PELD dataset (61 patients, 610 images) with instance masks for adipose tissue, bone, ligamentum flavum, and nerve. Experiments show that LMSF‑A is highly competitive (or even better than) in all evaluation metrics and much lighter than most instance segmentation methods requiring only 1.8M parameters and 8.8 GFLOPs, and it generalizes well to a public teeth benchmark. Code and dataset: https://github.com/hhwmortal/PELD‑Instance‑segmentation.

Abstract:
Polarization‑based underwater 3D imaging exploits polarization cues to suppress background scattering, exhibiting distinct advantages in turbid water. Although data‑driven polarization‑based underwater 3D reconstruction methods show great potential, existing public datasets lack sufficient diversity in scattering and observation conditions, hindering fair comparisons among different approaches, including single‑view and multi‑view polarization imaging methods. To address this limitation, we construct MuS‑Polar3D, a benchmark dataset comprising polarization images of 42 objects captured under seven quantitatively controlled scattering conditions and five viewpoints, together with high‑precision 3D models (+/‑ 0.05 mm accuracy), normal maps, and foreground masks. The dataset supports multiple vision tasks, including normal estimation, object segmentation, descattering, and 3D reconstruction. Inspired by computational imaging, we further decouple underwater 3D reconstruction under scattering into a two‑stage pipeline, namely descattering followed by 3D reconstruction, from an imaging‑chain perspective. Extensive evaluations using multiple baseline methods under complex scattering conditions demonstrate the effectiveness of the proposed benchmark, achieving a best mean angular error of 15.49 degrees. To the best of our knowledge, MuS‑Polar3D is the first publicly available benchmark dataset for quantitative turbidity underwater polarization‑based 3D imaging, enabling accurate reconstruction and fair algorithm evaluation under controllable scattering conditions. The dataset and code are publicly available at https://github.com/WangPuyun/MuS‑Polar3D.

Abstract:
High‑resolution remote sensing image semantic segmentation (HRSS) is a fundamental yet critical task in the field of Earth observation. However, it has long faced the challenges of high inter‑class similarity and large intra‑class variability. Existing approaches often struggle to effectively inject abstract yet strongly discriminative semantic knowledge into pixel‑level feature learning, leading to blurred boundaries and class confusion in complex scenes. To address these challenges, we propose Bidirectional Co‑Refinement Framework for HRSS (BiCoR‑Seg). Specifically, we design a Heatmap‑driven Bidirectional Information Synergy Module (HBIS), which establishes a bidirectional information flow between feature maps and class embeddings by generating class‑level heatmaps. Based on HBIS, we further introduce a hierarchical supervision strategy, where the interpretable heatmaps generated by each HBIS module are directly utilized as low‑resolution segmentation predictions for supervision, thereby enhancing the discriminative capacity of shallow features. In addition, to further improve the discriminability of the embedding representations, we propose a cross‑layer class embedding Fisher Discriminative Loss to enforce intra‑class compactness and enlarge inter‑class separability. Extensive experiments on the LoveDA, Vaihingen, and Potsdam datasets demonstrate that BiCoR‑Seg achieves outstanding segmentation performance while offering stronger interpretability. The released code is available at https://github.com/ShiJinghao566/BiCoR‑Seg.

Abstract:
Locating and retrieving objects from scene‑level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open‑vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real‑world settings. Open‑YOLO 3D alleviates this issue by using a real‑time 2D detector to classify class‑agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open‑YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open‑vocabulary detector. Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open‑ended text queries. Our code will be made available at https://github.com/ndkhanh360/BoxOVIS.

Abstract:
While 3DGS has emerged as a high‑fidelity scene representation, encoding rich, general‑purpose features directly from its primitives remains under‑explored. We address this gap by introducing Chorus, a multi‑teacher pretraining framework that learns a holistic feed‑forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher‑specific projectors to learn from language‑aligned, generalist, and object‑aware teachers, encouraging a shared embedding space that captures signals from high‑level semantics to fine‑grained structure. We evaluate Chorus on a wide range of tasks: open‑vocabulary semantic and instance segmentation, linear and decoder probing, data‑efficient supervision, as well as LLM‑based Q&A. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussian centers, colors, and estimated normals. Surprisingly, this encoder shows strong transfer and outperforms the point‑cloud baseline while using 39.9 times fewer training scenes. Finally, we propose a render‑and‑distill adaptation that facilitates out‑of‑domain finetuning.

Abstract:
At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low‑level attributes to high‑level concepts. Autoencoders represent a classical and long‑standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder‑based self‑supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre‑training tasks and more capable architectures. The model is trained on 2B web‑crawled images with a self‑curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed‑forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel‑space self‑supervised learning can serve as a promising alternative and a complement to latent‑space approaches.

Abstract:
In this paper, we focus on online zero‑shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB‑D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self‑supervised query refinement module with spatial‑semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state‑distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross‑frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state‑of‑the‑art RGB‑D‑based systems. Our code is available at https://github.com/VICO‑UoE/MoonSeg3R.

Abstract:
In recent years, the state‑of‑the‑art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object‑centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single‑frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high‑quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo‑annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse‑To‑Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state‑of‑the‑art across various benchmarks.

Abstract:
Predicting driver attention is a critical problem for developing explainable autonomous driving systems and understanding driver behavior in mixed human‑autonomous vehicle traffic scenarios. Although significant progress has been made through large‑scale driver attention datasets and deep learning architectures, existing works are constrained by narrow frontal field‑of‑view and limited driving diversity. Consequently, they fail to capture the full spatial context of driving environments, especially during lane changes, turns, and interactions involving peripheral objects such as pedestrians or cyclists. In this paper, we introduce DriverGaze360, a large‑scale 360^\circ field of view driver attention dataset, containing ～1 million gaze‑labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. Moreover, our panoramic attention prediction approach, DriverGaze360‑Net, jointly learns attention maps and attended objects by employing an auxiliary semantic segmentation head. This improves spatial awareness and attention prediction across wide panoramic inputs. Extensive experiments demonstrate that DriverGaze360‑Net achieves state‑of‑the‑art attention prediction performance on multiple metrics on panoramic driving images. Dataset and method available at https://dfki‑av.github.io/drivergaze360.

Abstract:
Deep learning has advanced two fundamentally different paradigms for instance segmentation: specialized models optimized through task‑specific fine‑tuning and generalist foundation models capable of zero‑shot segmentation. This work presents a comprehensive comparison between SAM3 (Segment Anything Model, also called SAMv3) operating in zero‑shot mode and three variants of Ultralytics YOLO11 (nano, medium, and large) fine‑tuned for instance segmentation. The evaluation is conducted on the MinneApple dataset, a dense benchmark comprising 670 orchard images with 28,179 annotated apple instances, enabling rigorous validation of model behavior under high object density and occlusion. Our analysis shows IoU choices can inflate performance gaps by up to 30%. At the appropriate IoU = 0.15 threshold, YOLO models achieve 68.9%, 72.2%, and 71.9% F1, while SAM3 reaches 59.8% in pure zero‑shot mode. However, YOLO exhibits steep degradation 48‑50 points across IoU ranges whereas SAM3 drops only 4 points, revealing 12 times superior boundary stability of SAM3. This highlights the strength of SAMv3 in mask precision versus specialization in detection completeness of YOLO11. We provide open‑source code, evaluation pipelines, and methodological recommendations, contributing to a deeper understanding of when specialized fine‑tuned models or generalist foundation models are preferable for dense instance segmentation tasks. This project repository is available on GitHub as https://github.com/Applied‑AI‑Research‑Lab/Segment‑Anything‑Model‑SAM3‑Zero‑Shot‑Segmentation‑Against‑Fine‑Tuned‑YOLO‑Detectors

Abstract:
Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two‑stage framework consisting of (I) RCDNet, a cross‑modal fusion network designed for referring change detection, and (II) RCDGen, a diffusion‑based synthetic data generation pipeline that produces realistic post‑change images and change maps for a specified category using only pre‑change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.

Abstract:
Weakly supervised semantic segmentation offers a label‑efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer‑Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross‑Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self‑attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state‑of‑the‑art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.

Abstract:
Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space‑time memory networks have improved segmentation accuracy, they often struggle with the trade‑off between capturing long‑range spatiotemporal dependencies and maintaining computational efficiency with fine‑grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key‑Value Association (LKVA) to effectively model inter‑frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key‑Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet‑Dynamic) and compared it with various state‑of‑the‑art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real‑time performance. Code is available at https://github.com/wangrui2025/GDKVM.

Abstract:
Most existing methods for training‑free open‑vocabulary semantic segmentation are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a comprehensive exploration of applying SAM 3 to the remote sensing open‑vocabulary tasks (i.e., 2D semantic segmentation, change detection, and 3D semantic segmentation) without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch‑level processing in geospatial scenes. Furthermore, we extend our method to open‑vocabulary change detection by a joint instance‑ and pixel‑level verification strategy built directly upon our fused logits. We evaluate our method on extensive remote sensing datasets and tasks, including 20 segmentation datasets, 3 change detection datasets, and a 3D segmentation dataset. Experiments show that our method achieves promising performance, demonstrating the potential of SAM 3 for remote sensing open‑vocabulary tasks. Our code is released at https://github.com/earth‑insights/SegEarth‑OV‑3.

Abstract:
Human Mesh Recovery (HMR) aims to reconstruct 3D human pose and shape from 2D observations and is fundamental to human‑centric understanding in real‑world scenarios. While recent image‑based HMR methods such as SAM 3D Body achieve strong robustness on in‑the‑wild images, they rely on per‑frame inference when applied to videos, leading to temporal inconsistency and degraded performance under occlusions. We address these issues without extra training by leveraging the inherent human continuity in videos. We propose SAM‑Body4D, a training‑free framework for temporally consistent and occlusion‑robust HMR from videos. We first generate identity‑consistent masklets using a promptable video segmentation model, then refine them with an Occlusion‑Aware module to recover missing regions. The refined masklets guide SAM 3D Body to produce consistent full‑body mesh trajectories, while a padding‑based parallel strategy enables efficient multi‑human inference. Experimental results demonstrate that SAM‑Body4D achieves improved temporal stability and robustness in challenging in‑the‑wild videos, without any retraining. Our code and demo are available at: https://github.com/gaomingqi/sam‑body4d.

Abstract:
Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high‑resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial‑D, a new large‑scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule‑based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial‑D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial‑d .

Abstract:
Camouflage is primarily context‑dependent yet current metrics for camouflaged scenarios overlook this critical factor. Instead, these metrics are originally designed for evaluating general or salient objects, with an inherent assumption of uncorrelated spatial context. In this paper, we propose a new contextualized evaluation paradigm, Context‑measure, built upon a probabilistic pixel‑aware correlation framework. By incorporating spatial dependencies and pixel‑wise camouflage quantification, our measure better aligns with human perception. Extensive experiments across three challenging camouflaged object segmentation datasets show that Context‑measure delivers more reliability than existing context‑independent metrics. Our measure can provide a foundational evaluation benchmark for various computer vision applications involving camouflaged patterns, such as agricultural, industrial, and medical scenarios. Code is available at https://github.com/pursuitxi/Context‑measure.

Abstract:
Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Despite the effectiveness of self‑training techniques in UDA, they struggle to learn each class in a balanced manner due to inherent class imbalance and distribution shift in both data and label space between domains. To address this issue, we propose Balanced Learning for Domain Adaptation (BLDA), a novel approach to directly assess and alleviate class bias without requiring prior knowledge about the distribution shift. First, we identify over‑predicted and under‑predicted classes by analyzing the distribution of predicted logits. Subsequently, we introduce a post‑hoc approach to align the logits distributions across different classes using shared anchor distributions. To further consider the network's need to generate unbiased pseudo‑labels during self‑training, we estimate logits distributions online and incorporate logits correction terms into the loss function. Moreover, we leverage the resulting cumulative density as domain‑shared structural knowledge to connect the source and target domains. Extensive experiments on two standard UDA semantic segmentation benchmarks demonstrate that BLDA consistently improves performance, especially for under‑predicted classes, when integrated into various existing methods. Code is available at https://github.com/Woof6/BLDA.

Abstract:
Pseudo‑label learning is widely used in semantic segmentation, particularly in label‑scarce scenarios such as unsupervised domain adaptation (UDA) and semisupervised learning (SSL). Despite its success, this paradigm can generate erroneous pseudo‑labels, which are further amplified during training due to utilization of one‑hot encoding. To address this issue, we propose ECOCSeg, a novel perspective for segmentation models that utilizes error‑correcting output codes (ECOC) to create a fine‑grained encoding for each class. ECOCSeg offers several advantages. First, an ECOC‑based classifier is introduced, enabling model to disentangle classes into attributes and handle partial inaccurate bits, improving stability and generalization in pseudo‑label learning. Second, a bit‑level label denoising mechanism is developed to generate higher‑quality pseudo‑labels, providing adequate and robust supervision for unlabeled images. ECOCSeg can be easily integrated with existing methods and consistently demonstrates significant improvements on multiple UDA and SSL benchmarks across different segmentation architectures. Code is available at https://github.com/Woof6/ECOCSeg.

Abstract:
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel‑level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic‑to‑real domain gap. We present AutoQ‑VIS, a novel unsupervised framework that bridges this gap through quality‑guided self‑training. Our approach establishes a closed‑loop system between pseudo‑label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state‑of‑the‑art performance with 52.6 \textAP_50 on YouTubeVIS‑2019 \textttval set, surpassing the previous state‑of‑the‑art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality‑aware self‑training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ‑VIS.

Abstract:
Cross‑Domain Few‑Shot Semantic Segmentation (CD‑FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce‑‑making standard episodic methods unreliable and computationally demanding at test time. To address these constraints, we propose DistillFSS, a framework that embeds support‑set knowledge directly into a model's parameters through a teacher‑‑student distillation process. By internalizing few‑shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher‑driven specialization. Combined with fine‑tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD‑FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state‑of‑the‑art baselines, particularly in multi‑class and multi‑shot scenarios, while offering substantial efficiency gains. The code is available at https://github.com/pasqualedem/DistillFSS.

Abstract:
Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two‑part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long‑term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under‑represented regions based on per‑pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D‑based approaches across diverse scene understanding tasks, including point‑based tracking, video object segmentation, and novel view synthesis. Our code is available at https://hrzhou2.github.io/motion4d‑web/.

Abstract:
Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision‑Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain‑aware Prompt‑driven Masked Transformer (DPMFormer). Firstly, we introduce domain‑aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain‑specific properties with a single source dataset, we propose domain‑aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain‑robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state‑of‑the‑art on various DGSS benchmarks. The code is available at https://github.com/jone1222/DPMFormer.

Abstract:
Reasoning‑centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single‑step prediction, ReVSeg executes three explicit operations ‑‑ semantics interpretation, temporal evidence selection, and spatial grounding ‑‑ aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi‑step reasoning chain, enabling the model to self‑refine its decision quality from outcome‑driven signals. Experimental results demonstrate that ReVSeg attains state‑of‑the‑art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .

Abstract:
Automatic 3D reconstruction of indoor spaces from 2D floor plans necessitates high‑precision semantic segmentation of structural elements, particularly walls. However, existing methods often struggle with detecting thin structures and maintaining geometric precision. To address this, we introduce MitUNet, a hybrid neural network designed to bridge the gap between global semantic context and fine‑grained structural details. Our architecture combines a Mix‑Transformer encoder with a U‑Net decoder enhanced with spatial and channel attention blocks. Optimized with the Tversky loss function, this approach achieves a balance between precision and recall, ensuring accurate boundary recovery. Experiments on the CubiCasa5k dataset and the regional dataset demonstrate MitUNet's superiority in generating structurally correct masks with high boundary accuracy, outperforming standard models. This tool provides a robust foundation for automated 3D reconstruction pipelines. To ensure reproducibility and facilitate future research, the source code and the regional dataset are publicly available at https://github.com/aliasstudio/mitunet and https://doi.org/10.5281/zenodo.17871079, respectively.

Abstract:
Abscesses in the head and neck represent an acute infectious process that can potentially lead to sepsis or mortality if not diagnosed and managed promptly. Accurate detection and delineation of these lesions on imaging are essential for diagnosis, treatment planning, and surgical intervention. In this study, we introduce AbscessHeNe, a curated and comprehensively annotated dataset comprising 4,926 contrast‑enhanced CT slices with clinically confirmed head and neck abscesses. The dataset is designed to facilitate the development of robust semantic segmentation models that can accurately delineate abscess boundaries and evaluate deep neck space involvement, thereby supporting informed clinical decision‑making. To establish performance baselines, we evaluate several state‑of‑the‑art segmentation architectures, including CNN, Transformer, and Mamba‑based models. The highest‑performing model achieved a Dice Similarity Coefficient of 0.39, Intersection‑over‑Union of 0.27, and Normalized Surface Distance of 0.67, indicating the challenges of this task and the need for further research. Beyond segmentation, AbscessHeNe is structured for future applications in content‑based multimedia indexing and case‑based retrieval. Each CT scan is linked with pixel‑level annotations and clinical metadata, providing a foundation for building intelligent retrieval systems and supporting knowledge‑driven clinical workflows. The dataset will be made publicly available at https://github.com/drthaodao3101/AbscessHeNe.git.

Abstract:
Video instance segmentation (VIS) for low‑light content remains highly challenging for both humans and machines alike, due to noise, blur and other adverse conditions. The lack of large‑scale annotated datasets and the limitations of current synthetic pipelines, particularly in modeling temporal degradations, further hinder progress. Moreover, existing VIS methods are not robust to the degradations found in low‑light videos and, consequently, perform poorly even after finetuning. In this paper, we introduce ELVIS (Enhance Low‑Light for Video Instance Segmentation), a framework that enables domain adaptation of state‑of‑the‑art VIS models to low‑light scenarios. ELVIS is comprised of an unsupervised synthetic low‑light video pipeline that models both spatial and temporal degradations, a calibration‑free degradation profile estimation network (VDP‑Net) and an enhancement decoder head that disentangles degradations from content features. ELVIS improves performances by up to +3.7AP on the synthetic low‑light YouTube‑VIS 2019 dataset and beats two‑stage baselines by at least +2.8AP on real low‑light videos. Code and dataset available at: \hrefhttps://joannelin168.github.io/research/ELVIShttps://joannelin168.github.io/research/ELVIS

Abstract:
Generating high‑fidelity, physically interactive 3D simulated tabletop scenes is essential for embodied AI ‑‑ especially for robotic manipulation policy learning and data synthesis. However, current text‑ or image‑driven 3D scene generation methods mainly focus on large‑scale scenes, struggling to capture the high‑density layouts and complex spatial relations that characterize tabletop scenes. To address these challenges, we propose TabletopGen, a training‑free, fully automatic framework that generates diverse, instance‑level interactive 3D tabletop scenes. TabletopGen accepts a reference image as input, which can be synthesized by a text‑to‑image model to enhance scene diversity. We then perform instance segmentation and completion on the reference to obtain per‑instance images. Each instance is reconstructed into a 3D model followed by canonical coordinate alignment. The aligned 3D models then undergo pose and scale estimation before being assembled into a collision‑free, simulation‑ready tabletop scene. A key component of our framework is a novel pose and scale alignment approach that decouples the complex spatial reasoning into two stages: a Differentiable Rotation Optimizer for precise rotation recovery and a Top‑view Spatial Alignment mechanism for robust translation and scale estimation, enabling accurate 3D reconstruction from 2D reference. Extensive experiments and user studies show that TabletopGen achieves state‑of‑the‑art performance, markedly surpassing existing methods in visual fidelity, layout accuracy, and physical plausibility, capable of generating realistic tabletop scenes with rich stylistic and spatial diversity. Our code will be publicly available.

Abstract:
Hand‑object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio‑temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine‑grained spatio‑temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine‑grained video question‑answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple‑choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part‑level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part‑level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best‑performing model, Gemini‑2.5‑Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part‑level geometric understanding. We also found that integrating explicit HOI‑related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

Abstract:
Interactive image segmentation(IIS) plays a critical role in generating precise annotations for remote sensing imagery, where objects often exhibit scale variations, irregular boundaries and complex backgrounds. However, existing IIS methods, primarily designed for natural images, struggle to generalize to remote sensing domains due to limited annotated data and computational overhead. To address these challenges, we proposed RS‑ISRefiner, a novel click‑based IIS framework tailored for remote sensing images. The framework employs an adapter‑based tuning strategy that preserves the general representations of Vision Foundation Models while enabling efficient learning of remote sensing‑specific spatial and boundary characteristics. A hybrid attention mechanism integrating convolutional local modeling with Transformer‑based global reasoning enhances robustness against scale diversity and scene complexity. Furthermore, an improved probability map modulation scheme effectively incorporates historical user interactions, yielding more stable iterative refinement and higher boundary accuracy. Comprehensive experiments on six remote sensing datasets, including iSAID, ISPRS Potsdam, SandBar, NWPU, LoveDA Urban and WHUBuilding, demonstrate that RS‑ISRefiner consistently outperforms state‑of‑the‑art IIS methods in terms of segmentation accuracy, efficiency and interaction cost. These results confirm the effectiveness and generalizability of our framework, making it highly suitable for high‑quality instance segmentation in practical remote sensing scenarios. The codes are available at https://github.com/wondelyan/VFM‑ISRefiner .

Abstract:
Domain generalization in semantic segmentation faces challenges from domain shifts, particularly under adverse conditions. While diffusion‑based data generation methods show promise, they introduce inherent misalignment between generated images and semantic masks. This paper presents FLEX‑Seg (FLexible Edge eXploitation for Segmentation), a framework that transforms this limitation into an opportunity for robust learning. FLEX‑Seg comprises three key components: (1) Granular Adaptive Prototypes that captures boundary characteristics across multiple scales, (2) Uncertainty Boundary Emphasis that dynamically adjusts learning emphasis based on prediction entropy, and (3) Hardness‑Aware Sampling that progressively focuses on challenging examples. By leveraging inherent misalignment rather than enforcing strict alignment, FLEX‑Seg learns robust representations while capturing rich stylistic variations. Experiments across five real‑world datasets demonstrate consistent improvements over state‑of‑the‑art methods, achieving 2.44% and 2.63% mIoU gains on ACDC and Dark Zurich. Our findings validate that adaptive strategies for handling imperfect synthetic data lead to superior domain generalization. Code is available at https://github.com/VisualScienceLab‑KHU/FLEX‑Seg.

Abstract:
Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data‑driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero‑shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine‑tuning. Consequently, training‑free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training‑ and annotation‑free prompting framework for nuclear instance segmentation. SPROUT leverages histology‑informed priors to construct slide‑specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training‑free nuclear instance segmentation in pathology.

Abstract:
Semantic segmentation is crucial for various biomedical applications, yet its reliance on large annotated datasets presents a bottleneck due to the high cost and specialized expertise required for manual labeling. Active Learning (AL) aims to mitigate this challenge by querying only the most informative samples, thereby reducing annotation effort. However, in the domain of 3D biomedical imaging, there is no consensus on whether AL consistently outperforms Random sampling. Four evaluation pitfalls hinder the current methodological assessment. These are (1) restriction to too few datasets and annotation budgets, (2) using 2D models on 3D images without partial annotations, (3) Random baseline not being adapted to the task, and (4) measuring annotation cost only in voxels. In this work, we introduce nnActive, an open‑source AL framework that overcomes these pitfalls by (1) means of a large scale study spanning four biomedical imaging datasets and three label regimes, (2) extending nnU‑Net by using partial annotations for training with 3D patch‑based query selection, (3) proposing Foreground Aware Random sampling strategies tackling the foreground‑background class imbalance of medical images and (4) propose the foreground efficiency metric, which captures the low annotation cost of background‑regions. We reveal the following findings: (A) while all AL methods outperform standard Random sampling, none reliably surpasses an improved Foreground Aware Random sampling; (B) benefits of AL depend on task specific parameters; (C) Predictive Entropy is overall the best performing AL method, but likely requires the most annotation effort; (D) AL performance can be improved with more compute intensive design choices. As a holistic, open‑source framework, nnActive can serve as a catalyst for research and application of AL in 3D biomedical imaging. Code is at: https://github.com/MIC‑DKFZ/nnActive

Abstract:
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time‑consuming manual annotation for new clinical application. Here, we propose MedSAM‑3, a text promptable medical segmentation model for medical image and video segmentation. By fine‑tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM‑3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open‑vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM‑3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent‑in‑the‑loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X‑ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey‑S‑Liu/MedSAM3.

Abstract:
Out‑of‑Distribution (OOD) detection in semantic segmentation aims to localize anomalous regions at the pixel level, advancing beyond traditional image‑level OOD techniques to better suit real‑world applications such as autonomous driving. Recent literature has successfully explored the adaptation of commonly used image‑level OOD methods‑‑primarily based on classifier‑derived confidence scores (e.g., energy or entropy)‑‑for this pixel‑precise task. However, these methods inherit a set of limitations, including vulnerability to overconfidence. In this work, we introduce SupLID, a novel framework that effectively guides classifier‑derived OOD scores by exploiting the geometrical structure of the underlying semantic space, particularly using Linear Intrinsic Dimensionality (LID). While LID effectively characterizes the local structure of high‑dimensional data by analyzing distance distributions, its direct application at the pixel level remains challenging. To overcome this, SupLID constructs a geometrical coreset that captures the intrinsic structure of the in‑distribution (ID) subspace. It then computes OOD scores at the superpixel level, enabling both efficient real‑time inference and improved spatial smoothness. We demonstrate that geometrical cues derived from SupLID serve as a complementary signal to traditional classifier confidence, enhancing the model's ability to detect diverse OOD scenarios. Designed as a post‑hoc scoring method, SupLID can be seamlessly integrated with any semantic segmentation classifier at deployment time. Our results demonstrate that SupLID significantly enhances existing classifier‑based OOD scores, achieving state‑of‑the‑art performance across key evaluation metrics, including AUR, FPR, and AUP. Code is available at https://github.com/hdnugit/SupLID.

Abstract:
LiDAR scene flow is the task of estimating per‑point 3D motion between consecutive point clouds. Recent methods achieve centimeter‑level accuracy on popular autonomous vehicle (AV) datasets, but are typically only trained and evaluated on a single sensor. In this paper, we aim to learn general motion priors that transfer to diverse and unseen LiDAR sensors. However, prior work in LiDAR semantic segmentation and 3D object detection demonstrate that naively training on multiple datasets yields worse performance than single dataset models. Interestingly, we find that this conventional wisdom does not hold for motion estimation, and that state‑of‑the‑art scene flow methods greatly benefit from cross‑dataset training without architectural modification. We posit that low‑level tasks such as motion estimation may be less sensitive to sensor configuration; indeed, our analysis shows that models trained on fast‑moving objects (e.g., from highway datasets) perform well on fast‑moving objects, even across different datasets. Informed by our analysis, we propose UniFlow, a feedforward model that unifies and trains on multiple large‑scale LiDAR scene flow datasets with diverse sensor placements and point cloud densities. Our frustratingly simple solution establishes a new state‑of‑the‑art on Waymo and nuScenes, improving over prior work by 5.1% and 35.2% respectively. Moreover, UniFlow achieves state‑of‑the‑art accuracy on unseen datasets like TruckScenes and AEVAScenes, outperforming prior dataset‑specific models by 30.1% and 22.5% respectively.

Abstract:
Few‑Shot Semantic Segmentation (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples, yet their decision‑making processes remain largely opaque. While explainable AI has advanced significantly in standard computer vision tasks, interpretability in FSS remains virtually unexplored despite its critical importance for understanding model behavior and guiding support set selection in data‑scarce scenarios. This paper introduces the first dedicated method for interpreting matching‑based FSS models by leveraging their inherent structural properties. Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions, using matching scores computed between support and query features at multiple feature levels. We extend standard interpretability evaluation metrics to the FSS domain and propose additional metrics to better capture the practical utility of explanations in few‑shot scenarios. Comprehensive experiments on FSS benchmark datasets, using different models, demonstrate that our Affinity Explainer significantly outperforms adapted standard attribution methods. Qualitative analysis reveals that our explanations provide structured, coherent attention patterns that align with model architectures and and enable effective model diagnosis. This work establishes the foundation for interpretable FSS research, enabling better model understanding and diagnostic for more reliable few‑shot segmentation systems. The source code is publicly available at https://github.com/pasqualedem/AffinityExplainer.

Abstract:
Due to the high cost of annotation or the rarity of some diseases, medical image segmentation is often limited by data scarcity and the resulting overfitting problem. Self‑supervised learning and semi‑supervised learning can mitigate the data scarcity challenge to some extent. However, both of these paradigms are complex and require either hand‑crafted pretexts or well‑defined pseudo‑labels. In contrast, data augmentation represents a relatively simple and straightforward approach to addressing data scarcity issues. It has led to significant improvements in image recognition tasks. However, the effectiveness of local image editing augmentation techniques in the context of segmentation has been less explored. We propose HSMix, a novel approach to local image editing data augmentation involving hard and soft mixing for medical semantic segmentation. In our approach, a hard‑augmented image is created by combining homogeneous regions (superpixels) from two source images. A soft mixing method further adjusts the brightness of these composed regions with brightness mixing based on locally aggregated pixel‑wise saliency coefficients. The ground‑truth segmentation masks of the two source images undergo the same mixing operations to generate the associated masks for the augmented images. Our method fully exploits both the prior contour and saliency information, thus preserving local semantic information in the augmented images while enriching the augmentation space with more diversity. Our method is a plug‑and‑play solution that is model agnostic and applicable to a range of medical imaging modalities. Extensive experimental evidence has demonstrated its effectiveness in a variety of medical segmentation tasks. The source code is available in https://github.com/DanielaPlusPlus/HSMix.

Abstract:
Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image‑to‑lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state‑of‑the‑art results in four widely‑recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.

Abstract:
Surgical video segmentation is crucial for computer‑assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt‑based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long‑term tracking. To address these limitations, we construct SA‑SV, the largest surgical iVOS benchmark with instance‑level spatio‑temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long‑term tracking and zero‑shot generalization. Building on SA‑SV, we propose SAM2S, a foundation model enhancing SAM2 for Surgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long‑term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity‑resilient learning to mitigate annotation inconsistencies across multi‑source datasets. Extensive experiments demonstrate that fine‑tuning on SA‑SV enables substantial performance gains, with SAM2 improving by 12.99 average \mathcalJ\&\mathcalF over vanilla SAM2. SAM2S further advances performance to 80.42 average \mathcalJ\&\mathcalF, surpassing vanilla and fine‑tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real‑time inference and strong zero‑shot generalization. Code and dataset will be released at https://jinlab‑imvr.github.io/SAM2S.

Abstract:
We present Upsample Anything, a lightweight test‑time optimization (TTO) framework that restores low‑resolution features to high‑resolution, pixel‑wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel‑level applications. Existing feature upsampling approaches depend on dataset‑specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per‑image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge‑aware operator that transfers seamlessly across architectures and modalities, enabling precise high‑resolution reconstruction of features, depth, or probability maps. It runs in only \approx0.419 \texts per 224x224 image and achieves state‑of‑the‑art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling. Project page: \hrefhttps://seominseok0429.github.io/Upsample‑Anything/https://seominseok0429.github.io/Upsample‑Anything/

Abstract:
Traditional video reasoning segmentation methods rely on supervised fine‑tuning, which limits generalization to out‑of‑distribution scenarios and lacks explicit reasoning. To address this, we propose VideoSeg‑R1, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text‑guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation‑propagation stage using SAM2 and XMem. A task difficulty‑aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg‑R1 achieves state‑of‑the‑art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg‑R1.

Abstract:
Reliable semantic segmentation is essential for clinical decision making, yet deep models rarely provide explicit statistical guarantees on their errors. We introduce a simple post‑hoc framework that constructs confidence masks with distribution‑free, image‑level control of false‑positive predictions. Given any pretrained segmentation model, we define a nested family of shrunken masks obtained either by increasing the score threshold or by applying morphological erosion. A labeled calibration set is used to select a single shrink parameter via conformal prediction, ensuring that, for new images that are exchangeable with the calibration data, the proportion of false positives retained in the confidence mask stays below a user‑specified tolerance with high probability. The method is model‑agnostic, requires no retraining, and provides finite‑sample guarantees regardless of the underlying predictor. Experiments on a polyp‑segmentation benchmark demonstrate target‑level empirical validity. Our framework enables practical, risk‑aware segmentation in settings where over‑segmentation can have clinical consequences. Code at https://github.com/deel‑ai‑papers/conseco.

Abstract:
As CLIP's global alignment limits its ability to capture fine‑grained details, recent efforts have focused on enhancing its region‑text alignment. However, current remote sensing (RS)‑specific CLIP variants still inherit this limited spatial awareness. We identify two key limitations behind this: (1) current RS image‑text datasets generate global captions from object‑level labels, leaving the original object‑level supervision underutilized; (2) despite the success of region‑text alignment methods in general domain, their direct application to RS data often leads to performance degradation. To address these, we construct the first multi‑granularity RS image‑text dataset, MGRS‑200k, featuring rich object‑level textual supervision for RS region‑category alignment. We further investigate existing fine‑grained CLIP tuning strategies and find that current explicit region‑text alignment methods, whether in a direct or indirect way, underperform due to severe degradation of CLIP's semantic coherence. Building on these, we propose FarSLIP, a Fine‑grained Aligned RS Language‑Image Pretraining framework. Rather than the commonly used patch‑to‑CLS self‑distillation, FarSLIP employs patch‑to‑patch distillation to align local and global visual cues, which improves feature discriminability while preserving semantic coherence. Additionally, to effectively utilize region‑text supervision, it employs simple CLS token‑based region‑category alignment rather than explicit patch‑level alignment, further enhancing spatial awareness. FarSLIP features improved fine‑grained vision‑language alignment in RS domain and sets a new state of the art not only on RS open‑vocabulary semantic segmentation, but also on image‑level tasks such as zero‑shot classification and image‑text retrieval. Our dataset, code, and models are available at https://github.com/NJU‑LHRS/FarSLIP.

Abstract:
Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution‑agnostic approach for large‑scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large‑scale Field Boundary Instance Segmentation‑22M (FBIS 22M) dataset, with a structured post‑processing, merging, vectorization, and simplification to generate vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi‑resolution image patches (0.25‑10m) and 22.9million validated field instances. The DelAny model delivers state‑of‑the‑art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero‑shot generalization and supports national‑scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25‑1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost‑effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national‑scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate‑Anything/.

Abstract:
Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB‑D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross‑attention mechanisms and insufficiently model intra‑ and inter‑modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel‑aware Transformer for RGB‑D indoor scene segmentation that simultaneously enhances intra‑modal representations and models inter‑modal interactions. At its core, the Intra‑Inter Modal Interaction Block (IIMIB) captures intra‑modal long‑range dependencies via self‑attention and models inter‑modal interactions with the Differential‑Shared Inter‑Modal (DSIM) module to disentangle modality‑specific and shared cues, enabling fine‑grained, pixel‑level cross‑modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB‑D information according to scene characteristics. Extensive experiments on the SUN RGB‑D and NYUDv2 benchmarks demonstrate that DiffPixelFormer‑L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer‑L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.

Abstract:
Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one‑thing‑one‑click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor‑intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose DBGroup, a two‑stage weakly supervised 3D instance segmentation framework that leverages scene‑level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual‑Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi‑view images. To further improve label quality, we develop two refinement strategies: Granularity‑Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi‑round self‑training on an end‑to‑end instance segmentation network using the refined pseudo‑labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse‑point‑level supervised 3D instance segmentation methods, while surpassing state‑of‑the‑art scene‑level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

Abstract:
Soiling detection for automotive cameras is a crucial part of advanced driver assistance systems to make them more robust to external conditions like weather, dust, etc. In this paper, we regard the soiling detection as a semantic segmentation problem. We provide a comprehensive comparison of popular segmentation methods and show their superiority in performance while comparing them to tile‑level classification approaches. Moreover, we present an extensive analysis of the Woodscape dataset showing that the original dataset contains a data‑leakage and imprecise annotations. To address these problems, we create a new data subset, which, despite being much smaller, provides enough information for the segmentation method to reach comparable results in a much shorter time. All our codes and dataset splits are available at https://github.com/filipberanek/woodscape_revision.

Abstract:
Underwater instance segmentation (UIS), integrating pixel‑level understanding and instance‑level discrimination, is a pivotal technology in marine resource exploration and ecological protection. In recent years, large‑scale pretrained visual foundation models, exemplified by DINO, have advanced rapidly and demonstrated remarkable performance on complex downstream tasks. In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine‑tuning process, facilitating better adaptation to the underwater domain. (2) The ObjectPrior Prompter, which incorporates binary segmentation‑based prompts to deliver object‑level priors, provides essential guidance for instance segmentation task that requires both object‑ and instance‑level reasoning. We conduct thorough experiments on the popular UIIS and USIS10K datasets, and the results show that DiveSeg achieves the state‑of‑the‑art performance. Code: https://github.com/ettof/Diveseg.

Abstract:
Semantic segmentation has achieved great success in ideal conditions. However, when facing extreme conditions (e.g., insufficient light, fierce camera motion), most existing methods suffer from significant information loss of RGB, severely damaging segmentation results. Several researches exploit the high‑speed and high‑dynamic event modality as a complement, but event and RGB are naturally heterogeneous, which leads to feature‑level mismatch and inferior optimization of existing multi‑modality methods. Different from these researches, we delve into the edge secret of both modalities for resilient fusion and propose a novel Edge‑awareness Semantic Concordance framework to unify the multi‑modality heterogeneous features with latent edge cues. In this framework, we first propose Edge‑awareness Latent Re‑coding, which obtains uncertainty indicators while realigning event‑RGB features into unified semantic space guided by re‑coded distribution, and transfers event‑RGB distributions into re‑coded features by utilizing a pre‑established edge dictionary as clues. We then propose Re‑coded Consolidation and Uncertainty Optimization, which utilize re‑coded edge features and uncertainty indicators to solve the heterogeneous event‑RGB fusion issues under extreme conditions. We establish two synthetic and one real‑world event‑RGB semantic segmentation datasets for extreme scenario comparisons. Experimental results show that our method outperforms the state‑of‑the‑art by a 2.55% mIoU on our proposed DERS‑XS, and possesses superior resilience under spatial occlusion. Our code and datasets are publicly available at https://github.com/iCVTEAM/ESC.

Abstract:
Dense and versatile image representations underpin the success of virtually all computer vision applications. However, state‑of‑the‑art networks, such as transformers, produce low‑resolution feature grids, which are suboptimal for dense prediction tasks. To address this limitation, we present FlowFeat, a high‑resolution and multi‑task feature representation. The key ingredient behind FlowFeat is a novel distillation technique that embeds a distribution of plausible apparent motions, or motion profiles. By leveraging optical flow networks and diverse video data, we develop an effective self‑supervised training framework that statistically approximates the apparent motion. With its remarkable level of spatial detail, FlowFeat encodes a compelling degree of geometric and semantic cues while exhibiting high temporal consistency. Empirically, FlowFeat significantly enhances the representational power of five state‑of‑the‑art encoders and alternative upsampling strategies across three dense tasks: video object segmentation, monocular depth estimation and semantic segmentation. Training FlowFeat is computationally inexpensive and robust to inaccurate flow estimation, remaining highly effective even when using unsupervised flow networks. Our work takes a step forward towards reliable and versatile dense image representations.

Abstract:
3D semantic scene understanding remains a long‑standing challenge in the 3D computer vision community. One of the key issues pertains to limited real‑world annotated data to facilitate generalizable models. The common practice to tackle this issue is to simulate new data. Although synthetic datasets offer scalability and perfect labels, their designer‑crafted scenes fail to capture real‑world complexity and sensor noise, resulting in a synthetic‑to‑real domain gap. Moreover, no benchmark provides synchronized real and simulated point clouds for segmentation‑oriented domain shift analysis. We introduce TrueCity, the first urban semantic segmentation benchmark with cm‑accurate annotated real‑world point clouds, semantic 3D city models, and annotated simulated point clouds representing the same city. TrueCity proposes segmentation classes aligned with international 3D city modeling standards, enabling consistent evaluation of synthetic‑to‑real gap. Our extensive experiments on common baselines quantify domain shift and highlight strategies for exploiting synthetic data to enhance real‑world 3D scene understanding. We are convinced that the TrueCity dataset will foster further development of sim‑to‑real gap quantification and enable generalizable data‑driven models. The data, code, and 3D models are available online: https://tum‑gis.github.io/TrueCity/

Abstract:
Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource‑intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under‑explored for event‑based vision. To address this, we propose Temporal‑Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long‑Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi‑scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic‑temporal features. By retraining event‑to‑video models on real‑world data and leveraging transformer‑based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross‑modality potential of image‑based VFMs for event‑based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.

Abstract:
Biomedical image segmentation is critical for precise structure delineation and downstream analysis. Traditional methods often struggle with noisy data, while deep learning models such as U‑Net have set new benchmarks in segmentation performance. nnU‑Net further automates model configuration, making it adaptable across datasets without extensive tuning. However, it requires a substantial amount of annotated data for cross‑validation, posing a challenge when only raw images but no labels are available. Large foundation models offer zero‑shot generalizability, but may underperform on specific datasets with unique characteristics, limiting their direct use for analysis. This work addresses these bottlenecks by proposing a data‑centric AI workflow that leverages active learning and pseudo‑labeling to combine the strengths of traditional neural networks and large foundation models while minimizing human intervention. The pipeline starts by generating pseudo‑labels from a foundation model, which are then used for nnU‑Net's self‑configuration. Subsequently, a representative core‑set is selected for minimal manual annotation, enabling effective fine‑tuning of the nnU‑Net model. This approach significantly reduces the need for manual annotations while maintaining competitive performance, providing an accessible solution for biomedical researchers to apply state‑of‑the‑art AI techniques in their segmentation tasks. The code is available at https://github.com/MMV‑Lab/AL_BioMed_img_seg.

Abstract:
We present MLPerf Automotive, the first standardized public benchmark for evaluating Machine Learning systems that are deployed for AI acceleration in automotive systems. Developed through a collaborative partnership between MLCommons and the Autonomous Vehicle Computing Consortium, this benchmark addresses the need for standardized performance evaluation methodologies in automotive machine learning systems. Existing benchmark suites cannot be utilized for these systems since automotive workloads have unique constraints including safety and real‑time processing that distinguish them from the domains that previously introduced benchmarks target. Our benchmarking framework provides latency and accuracy metrics along with evaluation protocols that enable consistent and reproducible performance comparisons across different hardware platforms and software implementations. The first iteration of the benchmark consists of automotive perception tasks in 2D object detection, 2D semantic segmentation, and 3D object detection. We describe the methodology behind the benchmark design including the task selection, reference models, and submission rules. We also discuss the first round of benchmark submissions and the challenges involved in acquiring the datasets and the engineering efforts to develop the reference implementations. Our benchmark code is available at https://github.com/mlcommons/mlperf_automotive.

Abstract:
Photo‑z algorithms that utilize SED template fitting have matured, and are widely adopted for use on high‑redshift near‑infrared data that provides a unique window into the early universe. Alternative photo‑z methods have been developed, largely within the context of low‑redshift optical surveys. Machine learning based approaches have gained footing in this regime, including those that utilize raw pixel information instead of aperture photometry. However, the efficacy of image‑based algorithms on high‑redshift, near‑infrared data remains underexplored. Here, we test the performance of Detection, Instance Segmentation and Classification with Deep Learning (DeepDISC) on photometric redshift estimation with NIRCam images from the JWST Advanced Deep Extragalactic Survey (JADES) program. DeepDISC is designed to produce probabilistic photometric redshift estimates directly from images, after detecting and deblending sources in a scene. Using NIRCam‑only images and a compiled catalog of spectroscopic redshifts, we show that DeepDISC produces reliable photo‑zs and uncertainties comparable to those estimated from template fitting using HST+JWST filters; DeepDISC even outperforms template fitting (lower scatter/fewer outliers) when the input photometric filters are matched. Compared with template fitting, DeepDISC does not require measured photometry from images, and can produce a catalog of 94000 photo‑zs in ~4 minutes on a single NVIDIA A40 GPU. While current spectroscopic training samples are small and incomplete in color‑magnitude space, this work demonstrates the potential of DeepDISC for increasingly larger image volumes and spectroscopic samples from ongoing and future programs. We discuss the impact of the training data on applications to broader samples and produce a catalog of photo‑zs for all JADES DR2 photometric sources in the GOOD‑S field, with quality flags indicating caveats.

Abstract:
Trees Outside Forests (TOF) play an important role in agricultural landscapes by supporting biodiversity, sequestering carbon, and regulating microclimates. Yet, most studies have treated TOF as a single class or relied on rigid rule‑based thresholds, limiting ecological interpretation and adaptability across regions. To address this, we evaluate deep learning for TOF classification using a newly generated dataset and high‑resolution aerial imagery from four agricultural landscapes in Germany. Specifically, we compare convolutional neural networks (CNNs), vision transformers, and hybrid CNN‑transformer models across six semantic segmentation architectures (ABCNet, LSKNet, FT‑UNetFormer, DC‑Swin, BANet, and U‑Net) to map four categories of woody vegetation: Forest, Patch, Linear, and Tree, derived from previous studies and governmental products. Overall, the models achieved good classification accuracy across the four landscapes, with the FT‑UNetFormer performing best (mean Intersection‑over‑Union 0.74; mean F1 score 0.84), underscoring the importance of spatial context understanding in TOF mapping and classification. Our results show good results for Forest and Linear class and reveal challenges particularly in classifying complex structures with high edge density, notably the Patch and Tree class. Our generalization experiments highlight the need for regionally diverse training data to ensure reliable large‑scale mapping. The dataset and code are openly available at https://github.com/Moerizzy/TOFMapper

Abstract:
Semantic segmentation has emerged as a fundamental problem in computer vision, gaining particular importance in real‑time applications such as autonomous driving. The main challenge is achieving high accuracy while operating under computational and hardware constraints. In this research, we present an FPGA‑based implementation of real‑time semantic segmentation leveraging the lightweight LMIINet architecture and the Coarse‑Grained Reconfigurable Array for Machine Learning (CGRA4ML) hardware framework. The model was trained using Quantization‑Aware Training (QAT) with 8‑bit precision on the Cityscapes dataset, reducing memory footprint by a factor of four while enabling efficient fixed‑point computations. Necessary modifications were applied to adapt the model to CGRA4ML constraints, including simplifying skip connections, employing hardware‑friendly operations such as depthwise‑separable and 1A‑1 convolutions, and redesigning parts of the Flatten Transformer. Our implementation achieves approximately 90% pixel accuracy and 45% mean Intersection‑over‑Union (mIoU), operating in real‑time at 20 frames per second (FPS) with 50.1 ms latency on the ZCU104 FPGA board. The results demonstrate the potential of CGRA4ML, with its flexibility in mapping modern layers and off‑chip memory utilization for skip connections, provides a path for implementing advanced semantic segmentation networks on FPGA for real‑time applications to outperform traditional GPU solutions in terms of power efficiency while maintaining competitive accuracy. The code for this project is publicly available at https://github.com/STAmirr/ cgra4ml_semantic_segmentation

Abstract:
Semantic segmentation demands dense pixel‑level annotations, which can be prohibitively expensive ‑ especially under extremely constrained labeling budgets. In this paper, we address the problem of low‑budget active learning for semantic segmentation by proposing a novel two‑stage selection pipeline. Our approach leverages a pre‑trained diffusion model to extract rich multi‑scale features that capture both global structure and fine details. In the first stage, we perform a hierarchical, representation‑based candidate selection by first choosing a small subset of representative pixels per image using MaxHerding, and then refining these into a diverse global pool. In the second stage, we compute an entropy‑augmented disagreement score (eDALD) over noisy multi‑scale diffusion features to capture both epistemic uncertainty and prediction confidence, selecting the most informative pixels for annotation. This decoupling of diversity and uncertainty lets us achieve high segmentation accuracy with only a tiny fraction of labeled pixels. Extensive experiments on four benchmarks (CamVid, ADE‑Bed, Cityscapes, and Pascal‑Context) demonstrate that our method significantly outperforms existing baselines under extreme pixel‑budget regimes. Our code is available at https://github.com/jn‑kim/two‑stage‑edald.

Abstract:
Spiking Neural Networks (SNNs) demonstrate significant potential for energy‑efficient neuromorphic computing through an event‑driven paradigm. While training methods and computational models have greatly advanced, SNNs struggle to achieve competitive performance in visual long‑sequence modeling tasks. In artificial neural networks, the effective receptive field (ERF) serves as a valuable tool for analyzing feature extraction capabilities in visual long‑sequence modeling. Inspired by this, we introduce the Spatio‑Temporal Effective Receptive Field (ST‑ERF) to analyze the ERF distributions across various Transformer‑based SNNs. Based on the proposed ST‑ERF, we reveal that these models suffer from establishing a robust global ST‑ERF, thereby limiting their visual feature modeling capabilities. To overcome this issue, we propose two novel channel‑mixer architectures: \underlinemulti‑\underlinelayer‑\underlineperceptron‑based m\underlineixer (MLPixer) and \underlinesplash‑and‑\underlinereconstruct \underlineblock (SRB). These architectures enhance global spatial ERF through all timesteps in early network stages of Transformer‑based SNNs, improving performance on challenging visual long‑sequence modeling tasks. Extensive experiments conducted on the Meta‑SDT variants and across object detection and semantic segmentation tasks further validate the effectiveness of our proposed method. Beyond these specific applications, we believe the proposed ST‑ERF framework can provide valuable insights for designing and optimizing SNN architectures across a broader range of tasks. The code is available at \hrefhttps://github.com/EricZhang1412/Spatial‑temporal‑ERF\faGithub~EricZhang1412/Spatial‑temporal‑ERF.

Abstract:
Large‑scale foundation models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the full‑parameter fine‑tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine‑tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of large‑scale models. In this paper, we propose a novel dynamic priors‑based fine‑tuning paradigm with fewer trainable parameters, dubbed Controllable‑LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine‑grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine‑tuning. Furthermore, we design a bi‑directional interaction adapter that employs cosine‑aligned deformable attention and channel‑oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine‑tuning. Extensive experiments validate the superiority of our \hrefhttps://github.com/CSYSI/Controllable‑LPMoE Controllable‑LPMoE approach, demonstrating excellent segmentation performance compared to 31 state‑of‑the‑art (SOTA) methods and adaptability to multiple binary object segmentation tasks.

Abstract:
High‑definition 3D city maps enable city planning and change detection, which is essential for municipal compliance, map maintenance, and asset monitoring, including both built structures and urban greenery. Conventional Digital Surface Model (DSM) and image differencing are sensitive to vertical bias and viewpoint mismatch, while original point cloud or voxel models require large memory, assume perfect alignment, and degrade thin structures. We propose an uncertainty‑aware, object‑centric method for city‑scale LiDAR‑based change detection. Our method aligns data from different time periods using multi‑resolution Normal Distributions Transform (NDT) and a point‑to‑plane Iterative Closest Point (ICP) method, normalizes elevation, and computes a per‑point level of detection from registration covariance and surface roughness to calibrate change decisions. Geometry‑based associations are refined by semantic and instance segmentation and optimized using class‑constrained bipartite assignment with augmented dummies to handle split‑merge cases. Tiled processing bounds memory and preserves narrow ground changes, while instance‑level decisions integrate overlap, displacement, and volumetric differences under local detection gating. We perform experiments on a Subiaco (Western Australia) dataset captured in 2023 and again in 2025. Our method achieves 95.3% accuracy, 90.8% mF1, and 82.9% mIoU, improving over the strongest baseline, Triplet KPConv, by 0.3, 0.6, and 1.1 points, respectively. The datasets are available on IEEE DataPort (2023: https://ieee‑dataport.org/documents/2023‑subiaco‑wa‑3d‑hd‑lidar‑point‑cloud‑maps‑dataset and 2025: https://ieee‑dataport.org/documents/2025‑subiaco‑wa‑3d‑hd‑lidar‑gnss‑point‑cloud‑maps‑dataset). The source code is available at https://github.com/HaitianWang/IEEE‑Sensor‑Journal‑Changing‑Detection.

Abstract:
Automated histopathological image analysis plays a vital role in computer‑aided diagnosis of various diseases. Among developed algorithms, deep learning‑based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention‑driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual‑encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved μIoU/μDice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state‑of‑the‑art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: https://github.com/NimaTorbati/ACS‑SegNet

Abstract:
The extraction of pharmacological knowledge from regulatory documents has become a key focus in biomedical natural language processing, with applications ranging from adverse event monitoring to AI‑assisted clinical decision support. However, research in this field has predominantly relied on English‑language corpora such as DrugBank, leaving a significant gap in resources tailored to other healthcare systems. To address this limitation, we introduce DART (Drug Annotation from Regulatory Texts), the first structured corpus of Italian Summaries of Product Characteristics derived from the official repository of the Italian Medicines Agency (AIFA). The dataset was built through a reproducible pipeline encompassing web‑scale document retrieval, semantic segmentation of regulatory sections, and clinical summarization using a few‑shot‑tuned large language model with low‑temperature decoding. DART provides structured information on key pharmacological domains such as indications, adverse drug reactions, and drug‑drug interactions. To validate its utility, we implemented an LLM‑based drug interaction checker that leverages the dataset to infer clinically meaningful interactions. Experimental results show that instruction‑tuned LLMs can accurately infer potential interactions and their clinical implications when grounded in the structured textual fields of DART. We publicly release our code on GitHub: https://github.com/PRAISELab‑PicusLab/DART.

Abstract:
Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high‑resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT‑L and 50% on ViT‑H while maintaining downstream performance, and can be applied to a previously fine‑tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high‑resolution dense visual tasks, achieving up to 30% faster training and inference in visual QA, object detection, and semantic segmentation.

Abstract:
In this work, we observe a counterintuitive phenomenon in self‑supervised learning (SSL): longer training may impair the performance of dense prediction tasks (e.g., semantic segmentation). We refer to this phenomenon as Self‑supervised Dense Degradation (SDD) and demonstrate its consistent presence across sixteen state‑of‑the‑art SSL methods with various losses, architectures, and datasets. When the model performs suboptimally on dense tasks at the end of training, measuring the performance during training becomes essential. However, evaluating dense performance effectively without annotations remains an open challenge. To tackle this issue, we introduce a Dense representation Structure Estimator (DSE), composed of a class‑relevance measure and an effective dimensionality measure. The proposed DSE is both theoretically grounded and empirically validated to be closely correlated with the downstream performance. Based on this metric, we introduce a straightforward yet effective model selection strategy and a DSE‑based regularization method. Experiments on sixteen SSL methods across four benchmarks confirm that model selection improves mIoU by 3.0% on average with negligible computational cost. Additionally, DSE regularization consistently mitigates the effects of dense degradation. Code is available at https://github.com/EldercatSAM/SSL‑Degradation.

Abstract:
Air‑ground collaborative intelligence is becoming a key approach for next‑generation urban intelligent transportation management, where aerial and ground systems work together on perception, communication, and decision‑making. However, the lack of a unified multi‑modal simulation environment has limited progress in studying cross‑domain perception, coordination under communication constraints, and joint decision optimization. To address this gap, we present TranSimHub, a unified simulation platform for air‑ground collaborative intelligence. TranSimHub offers synchronized multi‑view rendering across RGB, depth, and semantic segmentation modalities, ensuring consistent perception between aerial and ground viewpoints. It also supports information exchange between the two domains and includes a causal scene editor that enables controllable scenario creation and counterfactual analysis under diverse conditions such as different weather, emergency events, and dynamic obstacles. We release TranSimHub as an open‑source platform that supports end‑to‑end research on perception, fusion, and control across realistic air and ground traffic scenes. Our code is available at https://github.com/Traffic‑Alpha/TransSimHub.

Abstract:
Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k‑nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over‑squashing and missing multiple connections to gain the same information that could be gained from a long‑range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long‑range links. To this end, we propose LogViG, a novel hybrid CNN‑GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi‑scale and high‑resolution architectures, we introduce and apply a high‑resolution branch and fuse features between our high‑resolution and low‑resolution branches for a multi‑scale high‑resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti‑LogViG, achieves an average top‑1 accuracy on ImageNet‑1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long‑range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state‑of‑the‑art ViGs. Code is available at https://github.com/mmunir127/LogViG‑Official.

Abstract:
Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource‑constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high‑resolution videos in real time. But even with reliable, high‑bandwidth communication channels, the combined latency of video encoding, model inference, and round‑trip communication prohibits use for certain real‑time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real‑time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real‑time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference ‑‑ an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .

Abstract:
Establishing object‑level correspondence between egocentric and exocentric views is essential for intelligent assistants to deliver precise and intuitive visual guidance. However, this task faces numerous challenges, including extreme viewpoint variations, occlusions, and the presence of small objects. Existing approaches usually borrow solutions from video object segmentation models, but still suffer from the aforementioned challenges. Recently, the Segment Anything Model 2 (SAM 2) has shown strong generalization capabilities and excellent performance in video object segmentation. Yet, when simply applied to the ego‑exo correspondence (EEC) task, SAM 2 encounters severe difficulties due to ineffective ego‑exo feature fusion and limited long‑term memory capacity, especially for long videos. Addressing these problems, we propose a novel EEC framework based on SAM 2 with long‑term memories by presenting a dual‑memory architecture and an adaptive feature routing module inspired by Mixture‑of‑Experts (MoE). Compared to SAM 2, our approach features (i) a Memory‑View MoE module which consists of a dual‑branch routing mechanism to adaptively assign contribution weights to each expert feature along both channel and spatial dimensions, and (ii) a dual‑memory bank system with a simple yet effective compression strategy to retain critical long‑term information while eliminating redundancy. In the extensive experiments on the challenging EgoExo4D benchmark, our method, dubbed LM‑EEC, achieves new state‑of‑the‑art results and significantly outperforms existing methods and the SAM 2 baseline, showcasing its strong generalization across diverse scenarios. Our code and model are available at https://github.com/juneyeeHu/LM‑EEC.

Abstract:
Interest in robotics for forest management is growing, but perception in complex, natural environments remains a significant hurdle. Conditions such as heavy occlusion, variable lighting, and dense vegetation pose challenges to automated systems, which are essential for precision forestry, biodiversity monitoring, and the automation of forestry equipment. These tasks rely on advanced perceptual capabilities, such as detection and fine‑grained species classification of individual trees. Yet, existing datasets are inadequate to develop such perception systems, as they often focus on urban settings or a limited number of species. To address this, we present SilvaScenes, a new dataset for instance segmentation of tree species from under‑canopy images. Collected across five bioclimatic domains in Quebec, Canada, SilvaScenes features 1476 trees from 24 species with annotations from forestry experts. We demonstrate the relevance and challenging nature of our dataset by benchmarking modern deep learning approaches for instance segmentation. Our results show that, while tree segmentation is easy, with a top mean average precision (mAP) of 67.65%, species classification remains a significant challenge with an mAP of only 35.69%. Our dataset and source code will be available at https://github.com/norlab‑ulaval/SilvaScenes.

Abstract:
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine‑grained visual comprehension. Existing sampling strategies for LLM‑based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt[FIND] token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment‑Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non‑essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor‑updated Propagation (BAP), which leverages the most relevant moment as start point for high‑quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

Abstract:
Visual grouping ‑‑ operationalized through tasks such as instance segmentation, visual grounding, and object detection ‑‑ enables applications ranging from robotic perception to photo editing. These fundamental problems in computer vision are powered by large‑scale, painstakingly annotated datasets. Despite their impact, these datasets are costly to build, biased in coverage, and difficult to scale. Synthetic datasets offer a promising alternative but struggle with flexibility, accuracy, and compositional diversity. We introduce Synthetic Object Compositions (SOC), an accurate and scalable data synthesis pipeline via a novel object‑centric composition strategy. It composes high‑quality synthetic object segments into new images using 3D geometric layout augmentation and camera configuration augmentation with generative harmonization and mask‑area‑weighted blending, yielding accurate and diverse masks, boxes, and referring expressions. Models trained on just 100K of our synthetic images outperform those trained on larger real datasets (GRIT 20M, V3Det 200K) and synthetic pipelines (Copy‑Paste, X‑Paste, SynGround, SegGen) by +24‑36% ‑‑ achieving +10.9 AP on LVIS and +8.4 NAcc on gRefCOCO. Beyond the general open‑vocabulary setup, SOC also enables controllable dataset construction for different use cases and boosts performance in both low‑data and closed‑vocabulary scenarios. Augmenting LVIS and COCO with synthetic object segments delivers strong performance across different real‑data scales and yields even greater improvements under extremely limited real‑data conditions, including +6.59 AP on a 1% COCO data setup. Furthermore, this controllability enables targeted data generation for intra‑class referring, a diagnostic grounding task we propose that requires fine‑grained attribute discrimination.

Abstract:
LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real‑time processing, especially on resource‑constrained embedded systems. Previous state‑of‑the‑art methods often face a trade‑off between accuracy and speed. Point‑based and sparse convolution‑based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection‑based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test‑time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre‑processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP‑NeXt, a high‑speed and accurate LiDAR semantic segmentation network. We first propose a novel pre‑processing methodology that significantly reduces computational overhead. Then, we design the Conv‑SE‑NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi‑scale range‑point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP‑NeXt achieves a superior speed‑accuracy trade‑off compared to all state‑of‑the‑art methods, and, without relying on ensemble models or TTA, is comparable to the top‑ranked PTv3, while running 24× faster. The code is available at https://github.com/SamirAbouHaidar/HARP‑NeXt

Abstract:
Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi‑automated, uncertainty‑aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi‑source features, and trains an ensemble of segmentation networks to produce pseudo‑labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back‑projected to 3D, yielding densely annotated point clouds supported by a three‑tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine‑channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature‑enrichment strategy through cross‑dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty‑aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high‑quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at https://fz‑rit.github.io/through‑the‑lidars‑eye/.

Abstract:
The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self‑attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling‑, convolution‑, and attention‑based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans nine datasets (seven 2D and two 3D) covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low‑complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful in some settings despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer's channel‑MLPs already provide the necessary cross‑channel interactions.

Abstract:
Recent attempts to transfer features from 2D Vision‑Language Models (VLMs) to 3D semantic segmentation expose a persistent trade‑off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large‑scale annotated 3D data. We argue that this limitation stems from the dominant segmentation‑and‑matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D‑to‑3D transfer but remain latent within the noisy and view‑aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM‑generated 3D point features using geometric priors distilled from a 3D self‑supervised teacher model. During inference, we devise a Geometry‑Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade‑off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state‑of‑the‑art performance while utilizing only about 1.5% of the training data.

Abstract:
Automated semantic segmentation of whole‑slide images (WSIs) stained with hematoxylin and eosin (H&E) is essential for large‑scale artificial intelligence‑based biomarker analysis in breast cancer. However, existing public datasets for breast cancer segmentation lack the morphological diversity needed to support model generalizability and robust biomarker validation across heterogeneous patient cohorts. We introduce BrEast cancEr hisTopathoLogy sEgmentation (BEETLE), a dataset for multiclass semantic segmentation of H&E‑stained breast cancer WSIs. It consists of 587 biopsies and resections from three collaborating clinical centers and two public datasets, digitized using seven scanners, and covers all molecular subtypes and histological grades. Using diverse annotation strategies, we collected annotations across four classes ‑ invasive epithelium, non‑invasive epithelium, necrosis, and other ‑ with particular focus on morphologies underrepresented in existing datasets, such as ductal carcinoma in situ and dispersed lobular tumor cells. The dataset's diversity and relevance to the rapidly growing field of automated biomarker quantification in breast cancer ensure its high potential for reuse. Finally, we provide a well‑curated, multicentric external evaluation set to enable standardized benchmarking of breast cancer segmentation models.

Abstract:
Pixel‑level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human‑object interaction detection leveraging narrations \unicodex2013 natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects. We introduce Narration‑Supervised in‑Hand Object Segmentation (NS‑iHOS), a novel task where models have to learn to segment in‑hand objects by learning from natural‑language narrations in a weakly‑supervised regime. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly‑Supervised In‑hand Object Segmentation from Human Narrations (WISH), an end‑to‑end model distilling knowledge from narrations to learn plausible hand‑object associations and enable in‑hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open‑vocabulary object detectors and vision‑language models. Experiments on EPIC‑Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine‑grained pixel‑wise annotations. Code and data can be found at https://fpv‑iplab.github.io/WISH.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi‑modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text‑video understanding, several challenges remain, including insufficient exploitation of MLLMs' prior knowledge, prohibitive computational and memory costs for long‑duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video‑language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor‑Based Spatio‑Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio‑temporal structure. Moreover, the Clip‑Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state‑of‑the‑art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.

Abstract:
We present a novel technique for interpreting the neurons in CLIP‑ResNet by decomposing their contributions to the output into individual computation paths. More specifically, we analyze all pairwise combinations of neurons and the following attention heads of CLIP's attention‑pooling layer. We find that these neuron‑head pairs can be approximated by a single direction in CLIP‑ResNet's image‑text embedding space. Leveraging this insight, we interpret each neuron‑head pair by associating it with text. Additionally, we find that only a sparse set of the neuron‑head pairs have a significant contribution to the output value, and that some neuron‑head pairs, while polysemantic, represent sub‑concepts of their corresponding neurons. We use these observations for two applications. First, we employ the pairs for training‑free semantic segmentation, outperforming previous methods for CLIP‑ResNet. Second, we utilize the contributions of neuron‑head pairs to monitor dataset distribution shifts. Our results demonstrate that examining individual computation paths in neural networks uncovers interpretable units, and that such units can be utilized for downstream tasks.

Abstract:
Sa2VA is a recent model for language‑guided dense grounding in images and video that achieves state‑of‑the‑art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA‑i, that rectifies these issues and improves the results. In fact, Sa2VA‑i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref‑YT‑VOS, +3.3 on Ref‑DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA‑i‑1B model even performs on par with the original Sa2VA‑26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va‑i

Abstract:
Regularization is essential in deep learning to enhance generalization and mitigate overfitting. However, conventional techniques often rely on heuristics, making them less reliable or effective across diverse settings. We propose Self Identity Mapping (SIM), a simple yet effective, data‑intrinsic regularization framework that leverages an inverse mapping mechanism to enhance representation learning. By reconstructing the input from its transformed output, SIM reduces information loss during forward propagation and facilitates smoother gradient flow. To address computational inefficiencies, We instantiate SIM as ρ\textSIM by incorporating patch‑level feature sampling and projection‑based method to reconstruct latent features, effectively lowering complexity. As a model‑agnostic, task‑agnostic regularizer, SIM can be seamlessly integrated as a plug‑and‑play module, making it applicable to different network architectures and tasks. We extensively evaluate ρ\textSIM across three tasks: image classification, few‑shot prompt learning, and domain generalization. Experimental results show consistent improvements over baseline methods, highlighting ρ\textSIM's ability to enhance representation learning across various tasks. We also demonstrate that ρ\textSIM is orthogonal to existing regularization methods, boosting their effectiveness. Moreover, our results confirm that ρ\textSIM effectively preserves semantic information and enhances performance in dense‑to‑dense tasks, such as semantic segmentation and image translation, as well as in non‑visual domains including audio classification and time series anomaly detection. The code is publicly available at https://github.com/XiudingCai/SIM‑pytorch.

Abstract:
Text‑to‑image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross‑modal attention mechanisms. Recent multi‑modal diffusion transformers extend this by introducing joint self‑attention over concatenated image and text tokens, enabling richer and more scalable cross‑modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM‑DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM‑DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high‑quality semantic segmentation masks. We further demonstrate that applying a lightweight fine‑tuning scheme with mask‑annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.

Abstract:
Autonomous robotic systems applied to new domains require an abundance of expensive, pixel‑level dense labels to train robust semantic segmentation models under full supervision. This study proposes a model‑agnostic Depth Edge Alignment Loss to improve Weakly Supervised Semantic Segmentation models across different datasets. The methodology generates pixel‑level semantic labels from image‑level supervision, avoiding expensive annotation processes. While weak supervision is widely explored in traditional computer vision, our approach adds supervision with pixel‑level depth information, a modality commonly available in robotic systems. We demonstrate how our approach improves segmentation performance across datasets and models, but can also be combined with other losses for even better performance, with improvements up to +5.439, +1.274 and +16.416 points in mean Intersection over Union on the PASCAL VOC / MS COCO validation, and the HOPE static onboarding split, respectively. Our code is made publicly available at https://github.com/DTU‑PAS/DEAL.

Abstract:
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural‑language expressions, demanding fine‑grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi‑modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA (SaSaSa2VA) to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a \mathcalJ\&F of 67.45, ranking first and surpassing the runner‑up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test‑time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA.

Abstract:
While significant advances exist in pseudo‑label generation for semi‑supervised semantic segmentation, pseudo‑label selection remains understudied. Existing methods typically use fixed confidence thresholds to retain high‑confidence predictions as pseudo‑labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high‑confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low‑confidence predictions disrupts spatial‑semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo‑label selection as a convex optimization problem within the confidence distribution feature space, establishing sample‑specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from low‑reliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal, Cityscapes, and COCO benchmarks show that CSL performs favorably against state‑of‑the‑art methods. Code and model weights are available at https://github.com/PanLiuCSU/CSL.

Abstract:
Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action‑based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large‑scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action‑based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within‑category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action‑based video object segmentation task. Second, we build up the first action‑based video object segmentation under a label noise benchmark ActiSeg‑NL and adapt six label‑noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground‑background trade‑off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg‑NL.

Abstract:
Multi‑modal image segmentation faces real‑world deployment challenges from incomplete/corrupted modalities degrading performance. While existing methods address training‑inference modality gaps via specialized per‑combination models, they introduce high deployment costs by requiring exhaustive model subsets and model‑modality matching. In this work, we propose a unified modality‑relax segmentation network (UniMRSeg) through hierarchical self‑supervised compensation (HSSC). Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels. % First, we adopt modality reconstruction with the hybrid shuffled‑masking augmentation, encouraging the model to learn the intrinsic modality characteristics and generate meaningful representations for missing modalities through cross‑modal fusion. % Next, modality‑invariant contrastive learning implicitly compensates the feature space distance among incomplete‑complete modality pairs. Furthermore, the proposed lightweight reverse attention adapter explicitly compensates for the weak perceptual semantics in the frozen encoder. Last, UniMRSeg is fine‑tuned under the hybrid consistency constraint to ensure stable prediction under all modality combinations without large performance fluctuations. Without bells and whistles, UniMRSeg significantly outperforms the state‑of‑the‑art methods under diverse missing modality scenarios on MRI‑based brain tumor segmentation, RGB‑D semantic segmentation, RGB‑D/T salient object segmentation. The code will be released at https://github.com/Xiaoqi‑Zhao‑DLUT/UniMRSeg.

Abstract:
Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query‑based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2‑CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025‑LSVOS‑3rd‑place/MOSEv2_3rd_place.

Abstract:
We introduce VocAlign, a novel source‑free domain adaptation framework specifically designed for VLMs in open‑vocabulary semantic segmentation. Our method adopts a student‑teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo‑label generation by incorporating additional class concepts. To ensure efficiency, we use Low‑Rank Adaptation (LoRA) to fine‑tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top‑K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero‑shot segmentation benchmarks, setting a new standard for source‑free adaptation in the open‑vocabulary setting.

Abstract:
Recent emergence of memory‑based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor‑aware drop‑in memory module and introspection‑based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor‑Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state‑of‑the‑art results on ten. Furthermore, integrating the proposed distractor‑aware memory into a real‑time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non‑real‑time SAM2.1‑L on multiple tracking and segmentation benchmarks, while integration with edge‑based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.

Abstract:
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi‑target and non‑target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine‑grained pixel‑level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi‑granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance‑aware capabilities and the necessity of ensuring consistent predictions between instance‑level boxes and masks. To address these limitations, we propose InstanceVG, a multi‑task generalized visual grounding framework equipped with instance‑aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance‑level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance‑aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state‑of‑the‑art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.

Abstract:
Tram‑human interaction safety is an important challenge, given that trams frequently operate in densely populated areas, where collisions can range from minor injuries to fatal outcomes. This paper addresses the issue from the perspective of designing a solution leveraging digital image processing, deep learning, and artificial intelligence to improve the safety of pedestrians, drivers, cyclists, pets, and tram passengers. We present RailSafeNet, a real‑time framework that fuses semantic segmentation, object detection and a rule‑based Distance Assessor to highlight track intrusions. Using only monocular video, the system identifies rails, localises nearby objects and classifies their risk by comparing projected distances with the standard 1435mm rail gauge. Experiments on the diverse RailSem19 dataset show that a class‑filtered SegFormer B3 model achieves 65% intersection‑over‑union (IoU), while a fine‑tuned YOLOv8 attains 75.6% mean average precision (mAP) calculated at an intersection over union (IoU) threshold of 0.50. RailSafeNet therefore delivers accurate, annotation‑light scene understanding that can warn drivers before dangerous situations escalate. Code available at https://github.com/oValach/RailSafeNet.

Abstract:
Few‑shot semantic segmentation has recently attracted great attention. The goal is to develop a model capable of segmenting unseen classes using only a few annotated samples. Most existing approaches adapt a pre‑trained model by training from scratch an additional module. Achieving optimal performance with these approaches requires extensive training on large‑scale datasets. The Segment Anything Model 2 (SAM2) is a foundational model for zero‑shot image and video segmentation with a modular design. In this paper, we propose a Few‑Shot segmentation method based on SAM2 (FS‑SAM2), where SAM2's video capabilities are directly repurposed for the few‑shot task. Moreover, we apply a Low‑Rank Adaptation (LoRA) to the original modules in order to handle the diverse images typically found in standard datasets, unlike the temporally connected frames used in SAM2's pre‑training. With this approach, only a small number of parameters is meta‑trained, which effectively adapts SAM2 while benefiting from its impressive segmentation performance. Our method supports any K‑shot configuration. We evaluate FS‑SAM2 on the PASCAL‑5^i, COCO‑20^i and FSS‑1000 datasets, achieving remarkable results and demonstrating excellent computational efficiency during inference. Code is available at https://github.com/fornib/FS‑SAM2

Abstract:
Semantic segmentation networks (SSNs) are central to safety‑critical applications such as medical imaging and autonomous driving, where robustness under uncertainty is essential. However, existing probabilistic verification methods often fail to scale with the complexity and dimensionality of modern segmentation tasks, producing guarantees that are overly conservative and of limited practical value. We propose a probabilistic verification framework that is architecture‑agnostic and scalable to high‑dimensional input‑output spaces. Our approach employs conformal inference (CI), enhanced by a novel technique that we call the clipping block, to provide provable guarantees while mitigating the excessive conservatism of prior methods. Experiments on large‑scale segmentation models across CamVid, OCTA‑500, Lung Segmentation, and Cityscapes demonstrate that our framework delivers reliable safety guarantees while substantially reducing conservatism compared to state‑of‑the‑art approaches on segmentation tasks. We also provide a public GitHub repository (https://github.com/Navidhashemicodes/SSN_Reach_CLP_Surrogate) for this approach, to support reproducibility.

Abstract:
Infrared‑visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high‑level tasks. Indeed, existing semantic‑driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel‑wise image fusion and cross‑modal feature fusion perception tasks from a macroscopic task‑level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub‑network and a segmentation sub‑network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic‑aware capabilities for image fusion. On the other hand, by cascading the fusion sub‑network and a segmentation backbone, segmentation‑related knowledge is transferred to promote feature‑level fusion‑based segmentation. Within the framework, we design a novel multi‑stage Transformer decoder to aggregate fine‑grained multi‑scale fused features efficiently. Additionally, a dynamic factor based on the max‑min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi‑task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state‑of‑the‑art methods. The code is available at https://github.com/Abraham‑Einstein/MAFS/.

Abstract:
Autonomous‑driving perception systems require robust Multi‑Object Tracking (MOT) to operate reliably in dynamic environments. MOT maintains consistent object identities across frames while preserving spatial accuracy. Recent foundation models, such as SAM2, provide promptable video segmentation without task‑specific fine‑tuning. However, their direct application to Multi‑Object Tracking and Segmentation (MOTS) remains limited by the absence of explicit identity management mechanisms and by growing memory requirements during tracking. This work introduces Seg2Track‑SAM2, a framework that integrates pretrained object detectors with SAM2 and a dedicated Seg2Track module to support track initialization, data association, and track refinement. The method operates without dataset‑specific fine‑tuning and remains detector‑agnostic. Experimental evaluation on the KITTI MOTS and MOTS Challenge benchmarks shows that Seg2Track‑SAM2 ranks fourth overall in both datasets while achieving the highest association accuracy (AssA) among compared methods. In addition, a sliding‑window memory strategy reduces memory usage by up to 75% with minimal impact on tracking performance, enabling deployment under resource constraints. Together, these results indicate that Seg2Track‑SAM2 improves identity consistency and memory efficiency in MOTS without requiring dataset‑specific training. The code is available at https://github.com/hcmr‑lab/Seg2Track‑SAM2.

Abstract:
LiDAR point cloud semantic segmentation is essential for interpreting 3D environments in applications such as autonomous driving and robotics. Recent methods achieve strong performance by exploiting different point cloud representations or incorporating data from other sensors, such as cameras or external datasets. However, these approaches often suffer from high computational complexity and require large amounts of training data, limiting their generalization in data‑scarce scenarios. In this paper, we improve the performance of point‑based methods by effectively learning features from 2D representations through point‑plane projections, enabling the extraction of complementary information while relying solely on LiDAR data. Additionally, we introduce a geometry‑aware technique for data augmentation that aligns with LiDAR sensor properties and mitigates class imbalance. We implemented and evaluated our method that applies point‑plane projections onto multiple informative 2D representations of the point cloud. Experiments demonstrate that this approach leads to significant improvements in limited‑data scenarios, while also achieving competitive results on two publicly available standard datasets, as SemanticKITTI and PandaSet. The code of our method is available at https://github.com/SiMoM0/3PNet

Abstract:
Graph neural networks have recently shown strong promise for event reconstruction tasks in Liquid Argon Time Projection Chambers, yet their performance remains limited for underrepresented classes of particles, such as Michel electrons. In this work, we investigate physics‑informed strategies to improve semantic segmentation within the NuGraph2 architecture. We explore three complementary approaches: (i) enriching the input representation with context‑aware features derived from detector geometry and track continuity, (ii) introducing auxiliary decoders to capture class‑level correlations, and (iii) incorporating energy‑based regularization terms motivated by Michel electron energy distributions. Experiments on MicroBooNE public datasets show that physics‑inspired feature augmentation yields the largest gains, particularly boosting Michel electron precision and recall by disentangling overlapping latent space regions. In contrast, auxiliary decoders and energy‑regularization terms provided limited improvements, partly due to the hit‑level nature of NuGraph2, which lacks explicit particle‑ or event‑level representations. Our findings highlight that embedding physics context directly into node‑level inputs is more effective than imposing task‑specific auxiliary losses, and suggest that future hierarchical architectures such as NuGraph3, with explicit particle‑ and event‑level reasoning, will provide a more natural setting for advanced decoders and physics‑based regularization. The code for this work is publicly available on Github at https://github.com/vitorgrizzi/nugraph_phys/tree/main_phys.

Abstract:
Diminished reality (DR) refers to the digital removal of real‑world objects by compositing background content in their place. This thesis presents a real‑time, inpainting‑based DR system designed to enable privacy control in shared‑space mixed reality (MR) meetings. The system allows a primary headset user to selectively remove personal or sensitive items from their environment, ensuring that those objects are no longer visible to other participants. Removal is achieved through semantic segmentation and precise object selection, followed by real‑time inpainting from the viewpoint of a secondary observer, implemented using a mobile ZED 2i depth camera. The solution is designed to be portable and robust, requiring neither a fixed secondary viewpoint nor prior 3D scanning of the environment. The system utilises YOLOv11 for object detection and a modified Decoupled Spatial‑Temporal Transformer (DSTT) model for high‑quality video inpainting. At 720p resolution, the pipeline sustains frame rates exceeding 20 fps, demonstrating the feasibility of real‑time diminished reality for practical privacy‑preserving MR applications.

Abstract:
Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM‑adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM's rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM‑adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM‑adapter delivers state‑of‑the‑art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB‑easy and RGB‑hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal‑SAM‑Adapter.

Abstract:
RGB‑thermal (RGB‑T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre‑trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross‑modal feature fusion. This results in limited thermal feature extraction and suboptimal cross‑modal fusion, while the redundant encoders further compromises the model's real‑time efficiency. To address the above issues, we propose TUNI, with an RGB‑T encoder consisting of multiple stacked blocks that simultaneously perform multi‑modal feature extraction and cross‑modal fusion. By leveraging large‑scale pre‑training with RGB and pseudo‑thermal data, the RGB‑T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB‑T local module to strengthen the encoder's capacity for cross‑modal local feature fusion. The RGB‑T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB‑T modalities. Experimental results show that TUNI achieves competitive performance with state‑of‑the‑art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real‑time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.

Abstract:
Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State‑of‑the‑art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth‑guided multimodal fusion method that upgrades condition‑aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi‑task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth‑aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross‑modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state‑of‑the‑art panoptic and semantic segmentation performance on the challenging MUSES and DeLiVER datasets. Code and models are available at https://github.com/timbroed/DGFusion

Abstract:
Recently, learned image compression has attracted considerable attention due to its superior performance over traditional methods. However, most existing approaches employ a single entropy model to estimate the probability distribution of pixel values across the entire image, which limits their ability to capture the diverse statistical characteristics of different semantic regions. To overcome this limitation, we propose Segmentation‑Assisted Multi‑Entropy Models for Lossless Image Compression (SEEC). Our framework utilizes semantic segmentation to guide the selection and adaptation of multiple entropy models, enabling more accurate probability distribution estimation for distinct semantic regions. Experimental results on benchmark datasets demonstrate that SEEC achieves state‑of‑the‑art compression ratios while introducing only minimal encoding and decoding latency. With superior performance, the proposed model also supports Regions of Interest (ROIs) coding condition on the provided segmentation mask. Our code is available at https://github.com/chunbaobao/SEEC.

Abstract:
Segmenting 3D assets into their constituent parts is crucial for enhancing 3D understanding, facilitating model reuse, and supporting various applications such as part generation. However, current methods face limitations such as poor robustness when dealing with complex objects and cannot fully automate the process. In this paper, we propose a native 3D point‑promptable part segmentation model termed P^3‑SAM, designed to fully automate the segmentation of any 3D objects into components. Inspired by SAM, P^3‑SAM consists of a feature extractor, multiple segmentation heads, and an IoU predictor, enabling interactive segmentation for users. We also propose an algorithm to automatically select and merge masks predicted by our model for part instance segmentation. Our model is trained on a newly built dataset containing nearly 3.7 million models with reasonable segmentation labels. Comparisons show that our method achieves precise segmentation results and strong robustness on any complex objects, attaining state‑of‑the‑art performance. Our project page is available at https://murcherful.github.io/P3‑SAM/.

Abstract:
Histopathology image analysis is critical yet challenged by the demand of segmenting tissue regions and nuclei instances for tumor microenvironment and cellular morphology analysis. Existing studies focused on tissue semantic segmentation or nuclei instance segmentation separately, but ignored the inherent relationship between these two tasks, resulting in insufficient histopathology understanding. To address this issue, we propose a Co‑Seg framework for collaborative tissue and nuclei segmentation. Specifically, we introduce a novel co‑segmentation paradigm, allowing tissue and nuclei segmentation tasks to mutually enhance each other. To this end, we first devise a region‑aware prompt encoder (RP‑Encoder) to provide high‑quality semantic and instance region prompts as prior constraints. Moreover, we design a mutual prompt mask decoder (MP‑Decoder) that leverages cross‑guidance to strengthen the contextual consistency of both tasks, collaboratively computing semantic and instance segmentation masks. Extensive experiments on the PUMA dataset demonstrate that the proposed Co‑Seg surpasses state‑of‑the‑arts in the semantic, instance and panoptic segmentation of tumor tissues and nuclei instances. The source code is available at https://github.com/xq141839/Co‑Seg.

Abstract:
The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large‑scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning‑based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open‑source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim‑to‑real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer‑based instance segmentation, while also emphasizing the complementary role of modeling‑based and augmentation‑based synthetic data generation for sim‑to‑real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data‑efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at https://github.com/perrydoremi/PlantSegStudio.

Abstract:
Accurate 3D instance segmentation is crucial for high‑quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D‑to‑3D lifting approaches struggle to produce precise instance‑level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high‑fidelity 3D instance segmentation (SGS‑3D), a novel "split‑then‑grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS‑3D serves as a training‑free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co‑occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine‑grained object instances by exploiting both spatial continuity and high‑level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI‑360 demonstrate that SGS‑3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre‑trained models, yielding high‑fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available at https://github.com/wangchaolei7/SGS‑3D.

Abstract:
Fast and accurate object perception in low‑light traffic scenes has attracted increasing attention. However, due to severe illumination degradation and the lack of reliable visual cues, existing perception models and methods struggle to quickly adapt to and accurately predict in low‑light environments. Moreover, there is the absence of available large‑scale benchmark specifically focused on low‑light traffic scenes. To bridge this gap, we introduce a physically grounded illumination degradation method tailored to real‑world low‑light settings and construct Dark‑traffic, the largest densely annotated dataset to date for low‑light traffic scenes, supporting object detection, instance segmentation, and optical flow estimation. We further propose the Separable Learning Vision Model (SLVM), a biologically inspired framework designed to enhance perception under adverse lighting. SLVM integrates four key components: a light‑adaptive pupillary mechanism for illumination‑sensitive feature extraction, a feature‑level separable learning strategy for efficient representation, task‑specific decoupled branches for multi‑task separable learning, and a spatial misalignment‑aware fusion module for precise multi‑feature alignment. Extensive experiments demonstrate that SLVM achieves state‑of‑the‑art performance with reduced computational overhead. Notably, it outperforms RT‑DETR by 11.2 percentage points in detection, YOLOv12 by 6.1 percentage points in instance segmentation, and reduces endpoint error (EPE) of baseline by 12.37% on Dark‑traffic. On the LIS benchmark, the end‑to‑end trained SLVM surpasses Swin Transformer+EnlightenGAN and ConvNeXt‑T+EnlightenGAN by an average of 11 percentage points across key metrics, and exceeds Mask RCNN (with light enhancement) by 3.1 percentage points. The Dark‑traffic dataset and complete code is released at https://github.com/alanli1997/slvm.

Abstract:
Estimating accurate and well‑calibrated predictive uncertainty is important for enhancing the reliability of computer vision models, especially in safety‑critical applications like traffic scene perception. While ensemble methods are commonly used to quantify uncertainty by combining multiple models, a mixture of experts (MoE) offers an efficient alternative by leveraging a gating network to dynamically weight expert predictions based on the input. Building on the promising use of MoEs for semantic segmentation in our previous works, we show that well‑calibrated predictive uncertainty estimates can be extracted from MoEs without architectural modifications. We investigate three methods to extract predictive uncertainty estimates: predictive entropy, mutual information, and expert variance. We evaluate these methods for an MoE with two experts trained on a semantical split of the A2D2 dataset. Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics under out‑of‑distribution (OOD) data. Additionally, we evaluate routing uncertainty computed via gate entropy and find that simple gating mechanisms lead to better calibration of routing uncertainty estimates than more complex classwise gates. Finally, our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration. Our code is available at https://github.com/KASTEL‑MobilityLab/mixtures‑of‑experts/.

Abstract:
Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine‑grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce VCMamba, a novel vision backbone that integrates the strengths of CNNs and multi‑directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi‑directional Mamba blocks designed to efficiently model long‑range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba's effectiveness through extensive experiments on ImageNet‑1K classification and ADE20K semantic segmentation. Our VCMamba‑B achieves 82.6% top‑1 accuracy on ImageNet‑1K, surpassing PlainMamba‑L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN‑B by 0.3% with 64% fewer parameters. Furthermore, VCMamba‑B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer‑L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at https://github.com/Wertyuui345/VCMamba.

Abstract:
AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor‑intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open‑source web application designed to streamline image labeling through AI‑assisted automation. VisioFirm integrates state‑of‑the‑art foundation models into an interface with a filtering pipeline to reduce human‑in‑the‑loop efforts. This hybrid approach employs CLIP combined with pre‑trained detectors like Ultralytics models for common classes and zero‑shot models such as Grounding DINO for custom labels, generating initial annotations with low‑confidence thresholding to maximize recall. Through this framework, when tested on COCO‑type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on‑the‑fly segmentation powered by Segment Anything accelerated through WebGPU for browser‑side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP‑based disambiguate components and IoU‑graph for redundant detection suppression. VisioFirm can be accessed from \hrefhttps://github.com/OschAI/VisioFirmhttps://github.com/OschAI/VisioFirm.

Abstract:
Point clouds are widely used for infrastructure monitoring by providing geometric information, where segmentation is required for downstream tasks such as defect detection. Existing research has automated semantic segmentation of structural components, while brick‑level segmentation (identifying defects such as spalling and mortar loss) has been primarily conducted from RGB images. However, acquiring high‑resolution images is impractical in low‑light environments like masonry tunnels. Point clouds, though robust to dim lighting, are typically unstructured, sparse, and noisy, limiting fine‑grained segmentation. We present InfraDiffusion, a zero‑shot framework that projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null‑space Model (DDNM). Without task‑specific training, InfraDiffusion enhances visual clarity and geometric consistency of depth maps. Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick‑level segmentation using the Segment Anything Model (SAM), underscoring its potential for automated inspection of masonry assets. Our code and data is available at https://github.com/Jingyixiong/InfraDiffusion‑official‑implement.

Abstract:
The rise of Artificial Intelligence as a Service (AIaaS) democratizes access to pre‑trained models via Application Programming Interfaces (APIs), but also raises a fundamental question: how can local models be effectively trained using black‑box models that do not expose their weights, training data, or logits, a constraint in which current domain adaptation paradigms are impractical ? To address this challenge, we introduce the Black‑Box Distillation (B2D) setting, which enables local model adaptation under realistic constraints: (1) the API model is open‑vocabulary and trained on large‑scale general‑purpose data, and (2) access is limited to one‑hot predictions only. We identify that open‑vocabulary models exhibit significant sensitivity to input resolution, with different object classes being segmented optimally at different scales, a limitation termed the "curse of resolution". Our method, ATtention‑Guided sCaler (ATGC), addresses this challenge by leveraging DINOv2 attention maps to dynamically select optimal scales for black‑box model inference. ATGC scores the attention maps with entropy to identify informative scales for pseudo‑labelling, enabling effective distillation. Experiments demonstrate substantial improvements under black‑box supervision across multiple datasets while requiring only one‑hot API predictions. Our code is available at https://github.com/yasserben/ATGC.

Abstract:
Understanding objects in videos in terms of fine‑grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio‑temporal masklet with a corresponding object‑centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large‑scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground‑truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV‑Caption. We train our VoCap model at scale on a SAV‑Caption together with a mix of other image and video datasets. Our model yields state‑of‑the‑art results on referring expression video object segmentation, is competitive on semi‑supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google‑deepmind/vocap.

Abstract:
3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero‑shot 3DVG holds greater promise for real‑world applications since eliminating scene‑specific training requirements. However, existing zero‑shot methods face challenges of spatial‑limited reasoning due to reliance on single‑view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero‑shot 3DVG framework that leverages multi‑view real‑world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic‑relevant candidates. A proposal‑guided multi‑view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances‑query prompts, leveraging VLM's cross‑modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state‑of‑the‑art performance, achieving Acc@0.25 scores of 55.6% and 53.2%, surpassing previous zero‑shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real‑world applicability. The code is available at https://github.com/JiawLin/SeqVLM.

Abstract:
In this case study, we present a data‑efficient point cloud segmentation pipeline and training framework for robust segmentation of unimproved roads and seven other classes. Our method employs a two‑stage training framework: first, a projection‑based convolutional neural network is pre‑trained on a mixture of public urban datasets and a small, curated in‑domain dataset; then, a lightweight prediction head is fine‑tuned exclusively on in‑domain data. Along the way, we explore the application of Point Prompt Training to batch normalization layers and the effects of Manifold Mixup as a regularizer within our pipeline. We also explore the effects of incorporating histogram‑normalized ambients to further boost performance. Using only 50 labeled point clouds from our target domain, we show that our proposed training approach improves mean Intersection‑over‑Union from 33.5% to 51.8% and the overall accuracy from 85.5% to 90.8%, when compared to naive training on the in‑domain data. Crucially, our results demonstrate that pre‑training across multiple datasets is key to improving generalization and enabling robust segmentation under limited in‑domain supervision. Overall, this study demonstrates a practical framework for robust 3D semantic segmentation in challenging, low‑data scenarios. Our code is available at: https://github.com/andrewyarovoi/MD‑FRNet.

Abstract:
Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel‑level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic‑to‑real domain gap. We present AutoQ‑VIS, a novel unsupervised framework that bridges this gap through quality‑guided self‑training. Our approach establishes a closed‑loop system between pseudo‑label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state‑of‑the‑art performance with 52.6 \textAP_50 on YouTubeVIS‑2019 val set, surpassing the previous state‑of‑the‑art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality‑aware self‑training for unsupervised VIS. The source code of our method is available at https://github.com/wcbup/AutoQ‑VIS.

Abstract:
Semantic segmentation of remote sensing (RS) images is pivotal for comprehensive Earth observation, but the demand for interpreting new object categories, coupled with the high expense of manual annotation, poses significant challenges. Although open‑vocabulary semantic segmentation (OVSS) offers a promising solution, existing frameworks designed for natural images are insufficient for the unique complexities of RS data. They struggle with vast scale variations and fine‑grained details, and their adaptation often relies on extensive, costly annotations. To address this critical gap, this paper introduces SegEarth‑OV, the first framework for annotation‑free open‑vocabulary segmentation of RS images. Specifically, we propose SimFeatUp, a universal upsampler that robustly restores high‑resolution spatial details from coarse features, correcting distorted target shapes without any task‑specific post‑training. We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features, significantly enhancing local semantic fidelity. These components empower SegEarth‑OV to effectively harness the rich semantics of pre‑trained VLMs, making OVSS possible in optical RS contexts. Furthermore, to extend the framework's universality to other challenging RS modalities like SAR images, where large‑scale VLMs are unavailable and expensive to create, we introduce AlignEarth, which is a distillation‑based strategy and can efficiently transfer semantic knowledge from an optical VLM encoder to an SAR encoder, bypassing the need to build SAR foundation models from scratch and enabling universal OVSS across diverse sensor types. Extensive experiments on both optical and SAR datasets validate that SegEarth‑OV can achieve dramatic improvements over the SOTA methods, establishing a robust foundation for annotation‑free and open‑world Earth observation.

Abstract:
Echocardiography plays a central role in cardiac imaging, offering dynamic views of the heart that are essential for diagnosis and monitoring. However, image quality can be significantly degraded by haze arising from multipath reverberations, particularly in difficult‑to‑image patients. In this work, we propose a semantic‑guided, diffusion‑based dehazing algorithm developed for the MICCAI Dehazing Echocardiography Challenge (DehazingEcho2025). Our method integrates a pixel‑wise noise model, derived from semantic segmentation of hazy inputs into a diffusion posterior sampling framework guided by a generative prior trained on clean ultrasound data. Quantitative evaluation on the challenge dataset demonstrates strong performance across contrast and fidelity metrics. Code for the submitted algorithm is available at https://github.com/tristan‑deep/semantic‑diffusion‑echo‑dehazing.

Abstract:
In this report, we present our solution during the participation of the MLCAS 2025 GWFSS Challenge. This challenge hosts a semantic segmentation competition specific to wheat plants, which requires to segment three wheat organs including the head, leaf, and stem, and another background class. In 2025, participating a segmentation competition is significantly different from that in previous years where many tricks can play important roles. Nowadays most segmentation tricks have been well integrated into existing codebases such that our naive ViT‑Adapter baseline has already achieved sufficiently good performance. Hence, we believe the key to stand out among other competitors is to focus on the problem nature of wheat per se. By probing visualizations, we identify the key ‑‑ the stem matters. In contrast to heads and leaves, stems exhibit fine structure and occupy only few pixels, which suffers from fragile predictions and class imbalance. Building on our baseline, we present three technical improvements tailored to stems: i) incorporating a dynamic upsampler SAPA used to enhance detail delineation; ii) leveraging semi‑supervised guided distillation with stem‑aware sample selection to mine the treasure beneath unlabeled data; and iii) applying a test‑time scaling strategy to zoom in and segment twice the image. Despite being simple, the three improvements bring us to the first place of the competition, outperforming the second place by clear margins. Code and models will be released at https://github.com/tiny‑smart/gwfss25.

Abstract:
3D mapping in dynamic environments poses a challenge for modern researchers in robotics and autonomous transportation. There are no universal representations for dynamic 3D scenes that incorporate multimodal data such as images, point clouds, and text. This article takes a step toward solving this problem. It proposes a taxonomy of methods for constructing multimodal 3D maps, classifying contemporary approaches based on scene types and representations, learning methods, and practical applications. Using this taxonomy, a brief structured analysis of recent methods is provided. The article also describes an original modular method called M3DMap, designed for object‑aware construction of multimodal 3D maps for both static and dynamic scenes. It consists of several interconnected components: a neural multimodal object segmentation and tracking module; an odometry estimation module, including trainable algorithms; a module for 3D map construction and updating with various implementations depending on the desired scene representation; and a multimodal data retrieval module. The article highlights original implementations of these modules and their advantages in solving various practical tasks, from 3D object grounding to mobile manipulation. Additionally, it presents theoretical propositions demonstrating the positive effect of using multimodal data and modern foundational models in 3D mapping methods. Details of the taxonomy and method implementation are available at https://yuddim.github.io/M3DMap.

Abstract:
Non‑destructive 3D imaging of large multi‑particulate samples is essential for quantifying particle‑level properties, such as size, shape, and spatial distribution, across applications in mining, materials science, and geology. However, accurate instance segmentation of particles in tomographic data remains challenging due to high morphological variability and frequent particle contact, which limit the effectiveness of classical methods like watershed algorithms. While supervised deep learning approaches offer improved performance, they rely on extensive annotated datasets that are labor‑intensive, error‑prone, and difficult to scale. In this work, we propose self‑validated learning, a novel self‑training framework for particle instance segmentation that eliminates the need for manual annotations. Our method leverages implicit boundary detection and iteratively refines the training set by identifying particles that can be consistently matched across reshuffled scans of the same sample. This self‑validation mechanism mitigates the impact of noisy pseudo‑labels, enabling robust learning from unlabeled data. After just three iterations, our approach accurately segments over 97% of the total particle volume and identifies more than 54,000 individual particles in tomographic scans of quartz fragments. Importantly, the framework also enables fully autonomous model evaluation without the need for ground truth annotations, as confirmed through comparisons with state‑of‑the‑art instance segmentation techniques. The method is integrated into the Biomedisa image analysis platform (https://github.com/biomedisa/biomedisa/).

Abstract:
Predicting and tracking objects in real‑world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post‑processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS‑‑particularly when objects are small or structurally complex‑‑by extracting frequently appearing salient objects. Second, we present a three‑stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state‑of‑the‑art performance in multi‑object UVOS. Code is available at: https://github.com/MohammadMohammadzadehKalati/FTIO

Abstract:
Meta‑learning aims to uniformly sample homogeneous support‑query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over‑semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support‑query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly‑supervised few‑shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state‑of‑the‑art models, TLG achieves a 13.2% improvement on Pascal‑5\textsuperscripti and a 9.7% improvement on COCO‑20\textsuperscripti. To the best of our knowledge, TLG is also the first weakly supervised (image‑level) model that outperforms fully supervised (pixel‑level) models under the same backbone architectures. The code is available at https://github.com/jarch‑ma/TLG.

Abstract:
Accurate morphological quantification of renal pathology functional units relies on instance‑level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post‑processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph‑B2I, a dynamic, morphology‑guided binary‑to‑instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph‑B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and naïve combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: https://github.com/ddrrnn123/DyMorph‑B2I.

Abstract:
We introduce RynnEC, a video multimodal large language model designed for embodied cognition. Built upon a general‑purpose vision‑language foundation model, RynnEC incorporates a region encoder and a mask decoder, enabling flexible region‑level video interaction. Despite its compact architecture, RynnEC achieves state‑of‑the‑art performance in object property understanding, object segmentation, and spatial reasoning. Conceptually, it offers a region‑centric video paradigm for the brain of embodied agents, providing fine‑grained perception of the physical world and enabling more precise interactions. To mitigate the scarcity of annotated 3D datasets, we propose an egocentric video based pipeline for generating embodied cognition data. Furthermore, we introduce RynnEC‑Bench, a region‑centered benchmark for evaluating embodied cognitive capabilities. We anticipate that RynnEC will advance the development of general‑purpose cognitive cores for embodied agents and facilitate generalization across diverse embodied tasks. The code, model checkpoints, and benchmark are available at: https://github.com/alibaba‑damo‑academy/RynnEC

Abstract:
Novel view synthesis from images, for example, with 3D Gaussian splatting, has made great progress. Rendering fidelity and speed are now ready even for demanding virtual reality applications. However, the problem of assisting humans in collecting the input images for these rendering algorithms has received much less attention. High‑quality view synthesis requires uniform and dense view sampling. Unfortunately, these requirements are not easily addressed by human camera operators, who are in a hurry, impatient, or lack understanding of the scene structure and the photographic process. Existing approaches to guide humans during image acquisition concentrate on single objects or neglect view‑dependent material characteristics. We propose a novel situated visualization technique for scanning at multiple scales. During the scanning of a scene, our method identifies important objects that need extended image coverage to properly represent view‑dependent appearance. To this end, we leverage semantic segmentation and category identification, ranked by a vision‑language model. Spherical proxies are generated around highly ranked objects to guide the user during scanning. Our results show superior performance in real scenes compared to conventional view sampling strategies.

Abstract:
We present an overview of the Spatio‑temporal Instance Segmentation (SIS) challenge held in conjunction with the CVPR 2025 Event‑based Vision Workshop. The task is to predict accurate pixel‑level segmentation masks of defined object classes from spatio‑temporally aligned event camera and grayscale camera data. We provide an overview of the task, dataset, challenge details and results. Furthermore, we describe the methods used by the top‑5 ranking teams in the challenge. More resources and code of the participants' methods are available here: https://github.com/tub‑rip/MouseSIS/blob/main/docs/challenge_results.md

Abstract:
Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low‑contrast defect boundaries, necessitating annotators to cross‑reference multiple views. These views share a single ground truth (GT), forming a unique ``many‑to‑one'' relationship. This characteristic renders advanced semi‑supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a ``one‑to‑one'' relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low‑contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group‑oriented perspective and propose a human‑inspired solution: the Intra‑group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group‑oriented baseline using the Intra‑group View Sampling (IVS). Building on this insight, we introduce the Pseudo‑label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary‑aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet‑101 backbone as our segmentation model, we achieve a 70.6% mIoU on the CdZnTe dataset using only 2 group‑annotated data (5\textperthousand). The code is available at \hrefhttps://github.com/pipixiapipi/ICAFhttps://github.com/pipixiapipi/ICAF.

Abstract:
Semi‑supervised semantic segmentation (S4) has advanced remote sensing (RS) analysis by leveraging unlabeled data through pseudo‑labeling and consistency learning. However, existing S4 studies often rely on small‑scale datasets and models, limiting their practical applicability. To address this, we propose S5, the first scalable framework for semi‑supervised semantic segmentation in RS, which unlocks the potential of vast unlabeled Earth observation data typically underutilized due to costly pixel‑level annotations. Built upon existing large‑scale RS datasets, S5 introduces a data selection strategy that integrates entropy‑based filtering and diversity expansion, resulting in the RS4P‑1M dataset. Using this dataset, we systematically scale up S4 into a new pretraining paradigm, S4 pre‑training (S4P), to pretrain RS foundation models (RSFMs) of varying sizes on this extensive corpus, significantly boosting their performance on land cover segmentation and object detection tasks. Furthermore, during fine‑tuning, we incorporate a Mixture‑of‑Experts (MoE)‑based multi‑dataset fine‑tuning approach, which enables efficient adaptation to multiple RS benchmarks with fewer parameters. This approach improves the generalization and versatility of RSFMs across diverse RS benchmarks. The resulting RSFMs achieve state‑of‑the‑art performance across all benchmarks, underscoring the viability of scaling semi‑supervised learning for RS applications. All datasets, code, and models will be released at https://github.com/MiliLab/S5

Abstract:
Edge detection serves as a critical foundation for numerous computer vision applications, including object detection, semantic segmentation, and image editing, by extracting essential structural cues that define object boundaries and salient edges. To be viable for broad deployment across devices with varying computational capacities, edge detectors shall balance high accuracy with low computational complexity. While deep learning has evidently improved accuracy, they often suffer from high computational costs, limiting their applicability on resource‑constrained devices. This paper addresses the challenge of achieving that balance: i.e., how to efficiently capture discriminative features without relying on large‑size and sophisticated models. We propose PEdger++, a collaborative learning framework designed to reduce computational costs and model sizes while improving edge detection accuracy. The core principle of our PEdger++ is that cross‑information derived from heterogeneous architectures, diverse training moments, and multiple parameter samplings, is beneficial to enhance learning from an ensemble perspective. Extensive experimental results on the BSDS500, NYUD and Multicue datasets demonstrate the effectiveness of our approach, both quantitatively and qualitatively, showing clear improvements over existing methods. We also provide multiple versions of the model with varying computational requirements, highlighting PEdger++'s adaptability with respect to different resource constraints. Codes are accessible at https://github.com/ForawardStar/EdgeDetectionviaPEdgerPlus/.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training ‑‑ regardless of their actual relevance to the expression. We identify the core problem as the absence of an explicit temporal learning signal in conventional training paradigms. To address this, we introduce MeViS‑M, a dataset built upon the challenging MeViS benchmark, where we manually annotate temporal spans when each object is referred to by the expression. These annotations provide a direct, semantically grounded supervision signal that was previously missing. To leverage this signal, we propose Temporally Grounded Learning (TGL), a novel learning framework that directly incorporates temporal grounding into the training process. Within this frame‑ work, we introduce two key strategies. First, Moment‑guided Dual‑path Propagation (MDP) improves both grounding and tracking by decoupling language‑guided segmentation for relevant moments from language‑agnostic propagation for others. Second, Object‑level Selective Supervision (OSS) supervises only the objects temporally aligned with the expression in each training clip, thereby reducing semantic noise and reinforcing language‑conditioned learning. Extensive experiments demonstrate that our TGL framework effectively leverages temporal signal to establish a new state‑of‑the‑art on the challenging MeViS benchmark. We will make our code and the MeViS‑M dataset publicly available.

Abstract:
Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real‑world scenarios where visual concepts are unbounded. While Vision‑Language Models (VLMs) like CLIP have shown promise in open‑vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self‑attention module to obtain ``content'' and ``context'' features respectively. \reviseThe context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open‑vocabulary dense perception, consistently achieving state‑of‑the‑art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation. Code is available at https://github.com/xiaomoguhz/DeCLIP

Abstract:
Continual video instance segmentation demands both the plasticity to absorb new object categories and the stability to retain previously learned ones, all while preserving temporal consistency across frames. In this work, we introduce Contrastive Residual Injection and Semantic Prompting (CRISP), an earlier attempt tailored to address the instance‑wise, category‑wise, and task‑wise confusion in continual video instance segmentation. For instance‑wise learning, we model instance tracking and construct instance correlation loss, which emphasizes the correlation with the prior query space while strengthening the specificity of the current task query. For category‑wise learning, we build an adaptive residual semantic prompt (ARSP) learning framework, which constructs a learnable semantic residual prompt pool generated by category text and uses an adjustive query‑prompt matching mechanism to build a mapping relationship between the query of the current task and the semantic residual prompt. Meanwhile, a semantic consistency loss based on the contrastive learning is introduced to maintain semantic coherence between object queries and residual prompts during incremental training. For task‑wise learning, to ensure the correlation at the inter‑task level within the query space, we introduce a concise yet powerful initialization strategy for incremental prompts. Extensive experiments on YouTube‑VIS‑2019 and YouTube‑VIS‑2021 datasets demonstrate that CRISP significantly outperforms existing continual segmentation methods in the long‑term continual video instance segmentation task, avoiding catastrophic forgetting and effectively improving segmentation and classification performance. The code is available at https://github.com/01upup10/CRISP.

Abstract:
Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre‑training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre‑training method for ViTs on 3D point clouds that integrates masked point modeling with clustering‑based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance‑level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance‑level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:https://github.com/Amazingren/maskclu.

Abstract:
Semantic segmentation is fundamental to vision systems requiring pixel‑level scene understanding, yet deploying it on resource‑constrained devices demands efficient architectures. Although existing methods achieve real‑time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per‑pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual‑branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO‑Stuff‑164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer‑B0, SegNeXt‑T, and Mask2Former‑Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1‑0.2M additional parameters required.

Abstract:
Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame‑level and video‑level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task‑specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task‑specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task‑specific video prompt and a video context decoder. This decoder first embeds structural inter‑class relationships across frames into the frame prompt features, and then propagates task‑specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at https://github.com/JiahuaDong/HVPL.

Abstract:
Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real‑time inspection systems. We introduce KARMA (Kolmogorov‑Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one‑dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter‑efficient Tiny Kolmogorov‑Arnold Network (TiKAN) module leveraging low‑rank factorization for KAN‑based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi‑scale defect analysis; and (3) a static‑dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state‑of‑the‑art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real‑time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: https://github.com/faeyelab/karma.

Abstract:
Real‑time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long‑range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real‑time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state‑of‑the‑art performance. The code and model is available at https://github.com/maomao0819/BEVANet.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image‑referring expressions whose semantics focus only on object attributes and object‑object relations, video‑referring expressions also encompass event attributes and event‑event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video‑level summarization step to exchange the global cross‑modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video‑referring expression into highly expressive Referential Event Graph (REG), which is a single‑rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept‑Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question‑answer pair derived from the concept‑role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state‑of‑the‑art RVOS methods. Code is available at https://github.com/bio‑mlhui/EventRR

Abstract:
Recent self‑supervised image segmentation models have achieved promising performance on semantic segmentation and class‑agnostic instance segmentation. However, their pretraining schedule is multi‑stage, requiring a time‑consuming pseudo‑masks generation process between each training epoch. This time‑consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub‑optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo‑mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic‑level and instance‑level and multi‑granular pseudo‑masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self‑Supervised Universal Segmentation (S2‑UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation‑oriented pretext task, Query‑wise Self‑Distillation (QuerySD), is proposed to pretrain S2‑UniSeg to learn the local‑to‑global correspondences. Under the same setting, S2‑UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff‑27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M‑image subset of SA‑1B, S2‑UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio‑mlhui/S2‑UniSeg

Abstract:
Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS‑SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS‑SAM2 introduces two key strategies: multi‑temporal‑scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS‑SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM‑based and task‑specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS‑SAM2.

Abstract:
Video object segmentation (VOS) aims to segment specified target objects throughout a video. Although state‑of‑the‑art methods have achieved impressive performance (e.g., 90+% J&F) on benchmarks such as DAVIS and YouTube‑VOS, these datasets primarily contain salient, dominant, and isolated objects, limiting their generalization to real‑world scenarios. To bridge this gap, the coMplex video Object SEgmentation (MOSEv1) dataset was introduced to facilitate VOS research in complex scenes. Building on the foundations and insights of MOSEv1, we present MOSEv2, a significantly more challenging dataset designed to further advance VOS methods under real‑world conditions. MOSEv2 consists of 5,024 videos and 701,976 high‑quality masks for 10,074 objects across 200 categories. Compared to its predecessor, MOSEv2 introduces much greater scene complexity, including more frequent object disappearance and reappearance, severe occlusions and crowding, smaller objects, as well as a range of new challenges such as adverse weather (e.g., rain, snow, fog), low‑light scenes (e.g., nighttime, underwater), multi‑shot sequences, camouflaged objects, non‑physical targets (e.g., shadows, reflections), and scenarios requiring external knowledge. We benchmark 20 representative VOS methods under 5 different settings and observe consistent performance drops on MOSEv2. For example, SAM2 drops from 76.4% on MOSEv1 to only 50.9% on MOSEv2. We further evaluate 9 video object tracking methods and observe similar declines, demonstrating that MOSEv2 poses challenges across tasks. These results highlight that despite strong performance on existing datasets, current VOS methods still fall short under real‑world complexities. Based on our analysis of the observed challenges, we further propose several practical tricks that enhance model performance. MOSEv2 is publicly available at https://MOSE.video.

Abstract:
Event camera, a novel neuromorphic vision sensor, records data with high temporal resolution and wide dynamic range, offering new possibilities for accurate visual representation in challenging scenarios. However, event data is inherently sparse and noisy, mainly reflecting brightness changes, which complicates effective feature extraction. To address this, we propose a self‑supervised pre‑training framework to fully reveal latent information in event data, including edge information and texture cues. Our framework consists of three stages: Difference‑guided Masked Modeling, inspired by the event physical sampling process, reconstructs temporal intensity difference maps to extract enhanced information from raw event data. Backbone‑fixed Feature Transition contrasts event and image features without updating the backbone to preserve representations learned from masked modeling and stabilizing their effect on contrastive learning. Focus‑aimed Contrastive Learning updates the entire model to improve semantic discrimination by focusing on high‑value regions. Extensive experiments show our framework is robust and consistently outperforms state‑of‑the‑art methods on various downstream tasks, including object recognition, semantic segmentation, and optical flow estimation. The code and dataset are available at https://github.com/BIT‑Vision/EventPretrain.

Abstract:
Continual Semantic Segmentation (CSS) requires learning new classes without forgetting previously acquired knowledge, addressing the fundamental challenge of catastrophic forgetting in dense prediction tasks. However, existing CSS methods typically employ single‑stage encoder‑decoder architectures where segmentation masks and class labels are tightly coupled, leading to interference between old and new class learning and suboptimal retention‑plasticity balance. We introduce DecoupleCSS, a novel two‑stage framework for CSS. By decoupling class‑aware detection from class‑agnostic segmentation, DecoupleCSS enables more effective continual learning, preserving past knowledge while learning new classes. The first stage leverages pre‑trained text and image encoders, adapted using LoRA, to encode class‑specific information and generate location‑aware prompts. In the second stage, the Segment Anything Model (SAM) is employed to produce precise segmentation masks, ensuring that segmentation knowledge is shared across both new and previous classes. This approach improves the balance between retention and adaptability in CSS, achieving state‑of‑the‑art performance across a variety of challenging tasks. Our code is publicly available at: https://github.com/euyis1019/Decoupling‑Continual‑Semantic‑Segmentation.

Abstract:
Robust principal component analysis (RPCA) decomposes an observation matrix into low‑rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter‑stage transmission loss in the BAM, we introduce a Memory‑Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state‑of‑the‑art performance under various imaging scenarios. We further improve interpretability via visual and numerical low‑rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.

Abstract:
In recent years, transformer‑based methods have achieved remarkable progress in medical image segmentation due to their superior ability to capture long‑range dependencies. However, these methods typically suffer from two major limitations. First, their computational complexity scales quadratically with the input sequences. Second, the feed‑forward network (FFN) modules in vanilla Transformers typically rely on fully connected layers, which limits models' ability to capture local contextual information and multiscale features critical for precise semantic segmentation. To address these issues, we propose an efficient medical image segmentation network, named TCSAFormer. The proposed TCSAFormer adopts two key ideas. First, it incorporates a Compressed Attention (CA) module, which combines token compression and pixel‑level sparse attention to dynamically focus on the most relevant key‑value pairs for each query. This is achieved by pruning globally irrelevant tokens and merging redundant ones, significantly reducing computational complexity while enhancing the model's ability to capture relationships between tokens. Second, it introduces a Dual‑Branch Feed‑Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information, thereby strengthening the model's feature representation capability. We conduct extensive experiments on three publicly available medical image segmentation datasets: ISIC‑2018, CVC‑ClinicDB, and Synapse, to evaluate the segmentation performance of TCSAFormer. Experimental results demonstrate that TCSAFormer achieves superior performance compared to existing state‑of‑the‑art (SOTA) methods, while maintaining lower computational overhead, thus achieving an optimal trade‑off between efficiency and accuracy.

Abstract:
In the semantic segmentation of remote sensing images, acquiring complete ground objects is critical for achieving precise analysis. However, this task is severely hindered by two major challenges: high intra‑class variance and high inter‑class similarity. Traditional methods often yield incomplete segmentation results due to their inability to effectively unify class representations and distinguish between similar features. Even emerging class‑guided approaches are limited by coarse class prototype representations and a neglect of target structural information. Therefore, this paper proposes a Prototype‑Driven Structure Synergy Network (PDSSNet). The design of this network is based on a core concept, a complete ground object is jointly defined by its invariant class semantics and its variant spatial structure. To implement this, we have designed three key modules. First, the Adaptive Prototype Extraction Module (APEM) ensures semantic accuracy from the source by encoding the ground truth to extract unbiased class prototypes. Subsequently, the designed Semantic‑Structure Coordination Module (SSCM) follows a hierarchical semantics‑first, structure‑second principle. This involves first establishing a global semantic cognition, then leveraging structural information to constrain and refine the semantic representation, thereby ensuring the integrity of class information. Finally, the Channel Similarity Adjustment Module (CSAM) employs a dynamic step‑size adjustment mechanism to focus on discriminative features between classes. Extensive experiments demonstrate that PDSSNet outperforms state‑of‑the‑art methods. The source code is available at https://github.com/wangjunyi‑1/PDSSNet.

Abstract:
Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per‑scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed‑forward framework that jointly reconstructs a unified 3D scene representation enriched with open‑vocabulary semantics, directly from unposed multi‑view images. Our approach leverages a Cross‑View Transformer to robustly integrate information across arbitrary multi‑view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high‑fidelity novel view synthesis, open‑vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed‑forward pass. Extensive experiments demonstrate that Uni3R establishes a new state‑of‑the‑art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at https://github.com/HorizonRobotics/Uni3R.

Abstract:
Deep learning‑based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero‑shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain's style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off‑the‑shelf text‑to‑image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance in zero‑shot semantic segmentation. The code is available at https://github.com/ROUJINN/SDGPA

Abstract:
We propose Cut‑Once‑and‑LEaRn (COLER), a simple approach for unsupervised instance segmentation and object detection. COLER first uses our developed CutOnce to generate coarse pseudo labels, then enables the detector to learn from these masks. CutOnce applies Normalized Cut (NCut) only once and does not rely on any clustering methods (e.g., K‑Means), but it can generate multiple object masks in an image. Our work opens a new direction for NCut algorithm in multi‑object segmentation. We have designed several novel yet simple modules that not only allow CutOnce to fully leverage the object discovery capabilities of self‑supervised model, but also free it from reliance on mask post‑processing. During training, COLER achieves strong performance without requiring specially designed loss functions for pseudo labels, and its performance is further improved through self‑training. COLER is a zero‑shot unsupervised model that outperforms previous state‑of‑the‑art methods on multiple benchmarks. We believe our method can help advance the field of unsupervised object localization. Code is available at: https://github.com/Quantumcraft616/COLER.

Abstract:
Semi‑supervised semantic segmentation, which leverages a limited set of labeled images, helps to relieve the heavy annotation burden. While pseudo‑labeling strategies yield promising results, there is still room for enhancing the reliability of pseudo‑labels. Hence, we develop a semi‑supervised framework, namely DerProp, equipped with a novel derivative label propagation to rectify imperfect pseudo‑labels. Our label propagation method imposes discrete derivative operations on pixel‑wise feature vectors as additional regularization, thereby generating strictly regularized similarity metrics. Doing so effectively alleviates the ill‑posed problem that identical similarities correspond to different features, through constraining the solution space. Extensive experiments are conducted to verify the rationality of our design, and demonstrate our superiority over other methods. Codes are available at https://github.com/ForawardStar/DerProp/.

Abstract:
The significance of informative and robust point representations has been widely acknowledged for 3D scene understanding. Despite existing self‑supervised pre‑training counterparts demonstrating promising performance, the model collapse and structural information deficiency remain prevalent due to insufficient point discrimination difficulty, yielding unreliable expressions and suboptimal performance. In this paper, we present GaussianCross, a novel cross‑modal self‑supervised 3D representation learning architecture integrating feed‑forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. GaussianCross seamlessly converts scale‑inconsistent 3D point clouds into a unified cuboid‑normalized Gaussian representation without missing details, enabling stable and generalizable pre‑training. Subsequently, a tri‑attribute adaptive distillation splatting module is incorporated to construct a 3D feature field, facilitating synergetic feature capturing of appearance, geometry, and semantic cues to maintain cross‑modal consistency. To validate GaussianCross, we perform extensive evaluations on various benchmarks, including ScanNet, ScanNet200, and S3DIS. In particular, GaussianCross shows a prominent parameter and data efficiency, achieving superior performance through linear probing (<0.1% parameters) and limited data training (1% of scenes) compared to state‑of‑the‑art methods. Furthermore, GaussianCross demonstrates strong generalization capabilities, improving the full fine‑tuning accuracy by 9.3% mIoU and 6.1% AP_50 on ScanNet200 semantic and instance segmentation tasks, respectively, supporting the effectiveness of our approach. The code, weights, and visualizations are publicly available at \hrefhttps://rayyoh.github.io/GaussianCross/https://rayyoh.github.io/GaussianCross/.

Abstract:
Instance segmentation is critical in biomedical imaging to accurately distinguish individual objects like cells, which often overlap and vary in size. Recent query‑based methods, where object queries guide segmentation, have shown strong performance. While U‑Net has been a go‑to architecture in medical image segmentation, its potential in query‑based approaches remains largely unexplored. In this work, we present IAUNet, a novel query‑based U‑Net architecture. The core design features a full U‑Net architecture, enhanced by a novel lightweight convolutional Pixel decoder, making the model more efficient and reducing the number of parameters. Additionally, we propose a Transformer decoder that refines object‑specific features across multiple scales. Finally, we introduce the 2025 Revvity Full Cell Segmentation Dataset, a unique resource with detailed annotations of overlapping cell cytoplasm in brightfield images, setting a new benchmark for biomedical instance segmentation. Experiments on multiple public datasets and our own show that IAUNet outperforms most state‑of‑the‑art fully convolutional, transformer‑based, and query‑based models and cell segmentation‑specific models, setting a strong baseline for cell instance segmentation tasks. Code is available at https://github.com/SlavkoPrytula/IAUNet

Abstract:
Recent advances in Remote Sensing Foundation Models (RSFMs) have led to significant breakthroughs in the field. While many RSFMs have been pretrained with massive optical imagery, more multispectral/hyperspectral data remain lack of the corresponding foundation models. To leverage the advantages of spectral imagery in earth observation, we explore whether existing RSFMs can be effectively adapted to process diverse spectral modalities without requiring extensive spectral pretraining. In response to this challenge, we proposed SpectralX, an innovative parameter‑efficient fine‑tuning framework that adapt existing RSFMs as backbone while introducing a two‑stage training approach to handle various spectral inputs, thereby significantly improving domain generalization performance. In the first stage, we employ a masked‑reconstruction task and design a specialized Hyper Tokenizer (HyperT) to extract attribute tokens from both spatial and spectral dimensions. Simultaneously, we develop an Attribute‑oriented Mixture of Adapter (AoMoA) that dynamically aggregates multi‑attribute expert knowledge while performing layer‑wise fine‑tuning. With semantic segmentation as downstream task in the second stage, we insert an Attribute‑refined Adapter (Are‑adapter) into the first stage framework. By iteratively querying low‑level semantic features with high‑level representations, the model learns to focus on task‑beneficial attributes, enabling customized adjustment of RSFMs. Following this two‑phase adaptation process, SpectralX is capable of interpreting spectral imagery from new regions or seasons. The codes will be available from the website: https://github.com/YuxiangZhang‑BIT.

Abstract:
Vision Foundation Models(VFMs) have achieved remarkable success in various computer vision tasks. However, their application to semantic segmentation is hindered by two significant challenges: (1) the disparity in data scale, as segmentation datasets are typically much smaller than those used for VFM pre‑training, and (2) domain distribution shifts, where real‑world segmentation scenarios are diverse and often underrepresented during pre‑training. To overcome these limitations, we present Rein++, an efficient VFM‑based segmentation framework that demonstrates superior generalization from limited data and enables effective adaptation to diverse unlabeled scenarios. Specifically, Rein++ comprises a domain generalization solution Rein‑G and a domain adaptation solution Rein‑A. Rein‑G introduces a set of trainable, instance‑aware tokens that effectively refine the VFM's features for the segmentation task. This parameter‑efficient approach fine‑tunes less than 1% of the backbone's parameters, enabling robust generalization. Building on the Rein‑G, Rein‑A performs unsupervised domain adaptation at both the instance and logit levels to mitigate domain shifts. In addition, it incorporates a semantic transfer module that leverages the class‑agnostic capabilities of the segment anything model to enhance boundary details in the target domain. The integrated Rein++ pipeline first learns a generalizable model on a source domain (e.g., daytime scenes) and subsequently adapts it to diverse target domains (e.g., nighttime scenes) without any target labels. Comprehensive experiments demonstrate that Rein++ significantly outperforms state‑of‑the‑art methods with efficient training, underscoring its roles an efficient, generalizable, and adaptive segmentation solution for VFMs, even for large models with billions of parameters. The code is available at https://github.com/wloves/Rein.

Abstract:
Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel‑level annotations, such annotations are expensive and labor‑intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image‑level labels and natural language descriptions. WSFP introduces unique challenges due to the high co‑occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co‑occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co‑occurring component disentanglement strategy to explicitly reduce dataset‑level bias, and a text‑guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask‑HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \hrefhttps://github.com/CVI‑SZU/DisFaceRep\textcolorcyanhttps://github.com/CVI‑SZU/DisFaceRep.

Abstract:
Long‑term temporal information is crucial for event‑based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self‑supervised pre‑trained weights, feedforward models can outperform their recurrent counterparts. Current self‑supervised learning (SSL) methods for event‑based pre‑training largely mimic RGB image‑based approaches. They pre‑train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self‑supervised pre‑training framework tailored for learning spatio‑temporal information. TESPEC is well‑suited for recurrent models, as it is the first framework to leverage long event sequences during pre‑training. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high‑level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long‑term history of events. Extensive experiments demonstrate our state‑of‑the‑art results in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation. Project webpage: https://mhdmohammadi.github.io/TESPEC_webpage.

Abstract:
Semantic segmentation models trained on known object classes often fail in real‑world autonomous driving scenarios by confidently misclassifying unknown objects. While pixel‑wise out‑of‑distribution detection can identify unknown objects, existing methods struggle in complex scenes where rare object classes are often confused with truly unknown objects. We introduce an uncertainty‑aware likelihood ratio estimation method that addresses these limitations. Our approach uses an evidential classifier within a likelihood ratio test to distinguish between known and unknown pixel features from a semantic segmentation model, while explicitly accounting for uncertainty. Instead of producing point estimates, our method outputs probability distributions that capture uncertainty from both rare training examples and imperfect synthetic outliers. We show that by incorporating uncertainty in this way, outlier exposure can be leveraged more effectively. Evaluated on five standard benchmark datasets, our method achieves the lowest average false positive rate (2.5%) among state‑of‑the‑art while maintaining high average precision (90.91%) and incurring only negligible computational overhead. Code is available at https://github.com/glasbruch/ULRE.

Abstract:
Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed‑patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba‑based underwater instance segmentation model UIS‑Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut‑based hidden state weakening mechanism. Experimental results show that UIS‑Mamba achieves state‑of‑the‑art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS‑Mamba.

Abstract:
This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \copyright 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non‑flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state‑of‑the‑art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on <https://github.com/youngsunjang/SDSU_MidWest_Flood_2019>.

Abstract:
Unsupervised Video Object Segmentation (UVOS) aims to predict pixel‑level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over‑reliance on memorizing high‑level semantic features. UVOS inherently suffers from the deficiency of lacking fine‑grained information due to the absence of pixel‑level prior knowledge. Consequently, memory design relying solely on high‑level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow‑ and high‑level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel‑semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel‑guided Local Alignment Module (PLAM) and Semantic‑guided Global Integration Module (SGIM), we achieve delicate integration of the fine‑grained details in shallow‑level memory and the semantic representations in high‑level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI‑Net) consistently achieves state‑of‑the‑art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI‑Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: https://github.com/ZhengxyFlow/HMHI‑Net .

Abstract:
Multi‑modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi‑modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map‑level semantic segmentation due to the lack of frame‑wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high‑level scene understanding tasks. To address this gap and advance multi‑modal UAV perception, we introduce UAVScenes, a large‑scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well‑calibrated multi‑modal UAV dataset MARS‑LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame‑wise images and LiDAR point clouds, along with accurate 6‑degree‑of‑freedom (6‑DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6‑DoF localization, place recognition, and novel view synthesis (NVS). Our dataset is available at https://github.com/sijieaaa/UAVScenes

Abstract:
Unsupervised object discovery (UOD) aims to detect and segment objects in 2D images without handcrafted annotations. Recent progress in self‑supervised representation learning has led to some success in UOD algorithms. However, the absence of ground truth provides existing UOD methods with two challenges: 1) determining if a discovered region is foreground or background, and 2) knowing how many objects remain undiscovered. To address these two problems, previous solutions rely on foreground priors to distinguish if the discovered region is foreground, and conduct one or fixed iterations of discovery. However, the existing foreground priors are heuristic and not always robust, and a fixed number of discoveries leads to under or over‑segmentation, since the number of objects in images varies. This paper introduces UnionCut, a robust and well‑grounded foreground prior based on min‑cut and ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. In addition, we propose UnionSeg, a distilled transformer of UnionCut that outputs the foreground union more efficiently and accurately. Our experiments show that by combining with UnionCut or UnionSeg, previous state‑of‑the‑art UOD methods witness an increase in the performance of single object discovery, saliency detection and self‑supervised instance segmentation on various benchmarks. The code is available at https://github.com/YFaris/UnionCut.

Abstract:
Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large‑scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species‑level labels, bounding boxes or segmentation masks, and fine‑grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at https://dahlian00.github.io/AnimalCluePage/

Abstract:
Medical image segmentation plays an important role in computer‑aided diagnosis. Traditional convolution‑based U‑shape segmentation architectures are usually limited by the local receptive field. Existing vision transformers have been widely applied to diverse medical segmentation frameworks due to their superior capabilities of capturing global contexts. Despite the advantage, the real‑world application of vision transformers is challenged by their non‑linear self‑attention mechanism, requiring huge computational costs. To address this issue, the selective state space model (SSM) Mamba has gained recognition for its adeptness in modeling long‑range dependencies in sequential data, particularly noted for its efficient memory costs. In this paper, we propose MambaVesselNet++, a Hybrid CNN‑Mamba framework for medical image segmentation. Our MambaVesselNet++ is comprised of a hybrid image encoder (Hi‑Encoder) and a bifocal fusion decoder (BF‑Decoder). In Hi‑Encoder, we first devise the texture‑aware layer to capture low‑level semantic features by leveraging convolutions. Then, we utilize Mamba to effectively model long‑range dependencies with linear complexity. The Bi‑Decoder adopts skip connections to combine local and global information of the Hi‑Encoder for the accurate generation of segmentation masks. Extensive experiments demonstrate that MambaVesselNet++ outperforms current convolution‑based, transformer‑based, and Mamba‑based state‑of‑the‑arts across diverse medical 2D, 3D, and instance segmentation tasks. The code is available at https://github.com/CC0117/MambaVesselNet.

Abstract:
In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long‑term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state‑of‑the‑art AP score of 54.0 on YouTube‑VIS 2022, a dataset known for its challenging long videos. Project page: https://seung‑hun‑lee.github.io/projects/LOMM/

Abstract:
Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high‑level comprehension and are limited to text‑only responses, restricting the flexibility for object‑centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial‑Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object‑centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object‑centric video and image understanding. Project page: https://qirui‑chen.github.io/RGA3‑release/.

Abstract:
Recent advances in point cloud deep learning have led to models that achieve high per‑part labeling accuracy on large‑scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per‑vertex semantic segmentation of large‑scale human meshes. To achieve this, a pseudo‑ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point‑level labels are then backprojected onto the original mesh to produce per‑point pseudo ground truth annotations. Subsequently, a novel, memory‑efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space‑filling curve‑based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach. Project code and pre‑processed data is available at https://github.com/JamesMcCullochDickens/Human3DParsing/tree/master.

Abstract:
Transformer‑based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long‑range dependencies in large‑scale point clouds. While recent Mamba‑based approaches offer efficient processing with linear complexity, they struggle with feature representation when extracting 3D features. However, effectively combining these complementary strengths remains an open challenge in this field. In this paper, we propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation. In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity, enabling simultaneous capture of long‑range dependencies and fine‑grained local features. Extensive experiments demonstrate the effectiveness and generalization of our HybridTM on diverse indoor and outdoor datasets. Furthermore, our HybridTM achieves state‑of‑the‑art performance on ScanNet, ScanNet200, and nuScenes benchmarks. The code will be made available at https://github.com/deepinact/HybridTM.

Abstract:
We introduce Iwin Transformer, a novel position‑embedding‑free hierarchical vision transformer, which can be fine‑tuned directly from low to high resolution, through the collaboration of innovative interleaved window attention and depthwise separable convolution. This approach uses attention to connect distant tokens and applies convolution to link neighboring tokens, enabling global information exchange within a single module, overcoming Swin Transformer's limitation of requiring two consecutive blocks to approximate global attention. Extensive experiments on visual benchmarks demonstrate that Iwin Transformer exhibits strong competitiveness in tasks such as image classification (87.4 top‑1 accuracy on ImageNet‑1K), semantic segmentation and video action recognition. We also validate the effectiveness of the core component in Iwin as a standalone module that can seamlessly replace the self‑attention module in class‑conditional image generation. The concepts and methods introduced by the Iwin Transformer have the potential to inspire future research, like Iwin 3D Attention in video generation. The code and models are available at https://github.com/cominder/Iwin‑Transformer.

Abstract:
Electrocardiogram (ECG) delineation, the segmentation of meaningful waveform features, is critical for clinical diagnosis. Despite recent advances using deep learning, progress has been limited by the scarcity of publicly available annotated datasets. Semi‑supervised learning presents a promising solution by leveraging abundant unlabeled ECG data. In this study, we present SemiSegECG, the first systematic benchmark for semi‑supervised semantic segmentation (SemiSeg) in ECG delineation. We curated and unified multiple public datasets, including previously underused sources, to support robust and diverse evaluation. We adopted five representative SemiSeg algorithms from computer vision, implemented them on two different architectures: the convolutional network and the transformer, and evaluated them in two different settings: in‑domain and cross‑domain. Additionally, we propose ECG‑specific training configurations and augmentation strategies and introduce a standardized evaluation framework. Our results show that the transformer outperforms the convolutional network in semi‑supervised ECG delineation. We anticipate that SemiSegECG will serve as a foundation for advancing semi‑supervised ECG delineation methods and will facilitate further research in this domain.

Abstract:
In Unsupervised Domain Adaptive Semantic Segmentation (UDA‑SS), a model is trained on labeled source domain data (e.g., synthetic images) and adapted to an unlabeled target domain (e.g., real‑world images) without access to target annotations. Existing UDA‑SS methods often struggle to balance fine‑grained local details with global contextual information, leading to segmentation errors in complex regions. To address this, we introduce the Adaptive Feature Refinement (AFR) module, which enhances segmentation accuracy by refining highresolution features using semantic priors from low‑resolution logits. AFR also integrates high‑frequency components, which capture fine‑grained structures and provide crucial boundary information, improving object delineation. Additionally, AFR adaptively balances local and global information through uncertaintydriven attention, reducing misclassifications. Its lightweight design allows seamless integration into HRDA‑based UDA methods, leading to state‑of‑the‑art segmentation performance. Our approach improves existing UDA‑SS methods by 1.05% mIoU on GTA V ‑‑> Cityscapes and 1.04% mIoU on Synthia‑‑>Cityscapes. The implementation of our framework is available at: https://github.com/Masrur02/AFRDA

Abstract:
Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two‑stage pipeline ‑ audio tagging followed by label‑conditioned source separation ‑ but are often constrained by the absence of fine‑grained temporal information critical for effective separation. In this work, we address this limitation by introducing a novel approach for S5 that enhances the synergy between the event detection and source separation stages. Our key contributions are threefold. First, we fine‑tune a pre‑trained Transformer to detect active sound classes. Second, we utilize a separate instance of this fine‑tuned Transformer to perform sound event detection (SED), providing the separation module with detailed, time‑varying guidance. Third, we implement an iterative refinement mechanism that progressively enhances separation quality by recursively reusing the separator's output from previous iterations. These advancements lead to significant improvements in both audio tagging and source separation performance, as demonstrated by our system's second‑place finish in Task 4 of the DCASE Challenge 2025. Our implementation and model checkpoints are available in our GitHub repository: https://github.com/theMoro/dcase25task4 .

Abstract:
The tumor immune microenvironment (TIME) in non‑small cell lung cancer (NSCLC) histopathology contains morphological and molecular characteristics predictive of immunotherapy response. Computational quantification of TIME characteristics, such as cell detection and tissue segmentation, can support biomarker development. However, currently available digital pathology datasets of NSCLC for the development of cell detection or tissue segmentation algorithms are limited in scope, lack annotations of clinically prevalent metastatic sites, and forgo molecular information such as PD‑L1 immunohistochemistry (IHC). To fill this gap, we introduce the IGNITE data toolkit, a multi‑stain, multi‑centric, and multi‑scanner dataset of annotated NSCLC whole‑slide images. We publicly release 887 fully annotated regions of interest from 155 unique patients across three complementary tasks: (i) multi‑class semantic segmentation of tissue compartments in H&E‑stained slides, with 16 classes spanning primary and metastatic NSCLC, (ii) nuclei detection, and (iii) PD‑L1 positive tumor cell detection in PD‑L1 IHC slides. To the best of our knowledge, this is the first public NSCLC dataset with manual annotations of H&E in metastatic sites and PD‑L1 IHC.

Abstract:
Transformers and Mamba, initially invented for natural language processing, have inspired backbone architectures for visual recognition. Recent studies integrated Local Attention Transformers with Mamba to capture both local details and global contexts. Despite competitive performance, these methods are limited to simple stacking of Transformer and Mamba layers without any interaction mechanism between them. Thus, deep integration between Transformer and Mamba layers remains an open problem. We address this problem by proposing A2Mamba, a powerful Transformer‑Mamba hybrid network architecture, featuring a new token mixer termed Multi‑scale Attention‑augmented State Space Model (MASS), where multi‑scale attention maps are integrated into an attention‑augmented SSM (A2SSM). A key step of A2SSM performs a variant of cross‑attention by spatially aggregating the SSM's hidden states using the multi‑scale attention maps, which enhances spatial dependencies pertaining to a two‑dimensional space while improving the dynamic modeling capabilities of SSMs. Our A2Mamba outperforms all previous ConvNet‑, Transformer‑, and Mamba‑based architectures in visual recognition tasks. For instance, A2Mamba‑L achieves an impressive 86.1% top‑1 accuracy on ImageNet‑1K. In semantic segmentation, A2Mamba‑B exceeds CAFormer‑S36 by 2.5% in mIoU, while exhibiting higher efficiency. In object detection and instance segmentation with Cascade Mask R‑CNN, A2Mamba‑S surpasses MambaVision‑B by 1.2%/0.9% in AP^b/AP^m, while having 40% less parameters. Code is publicly available at https://github.com/LMMMEng/A2Mamba.

Abstract:
Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth's surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial‑Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi‑view Google Earth frames. Each scene provides pose‑annotated multi‑view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large‑scale 3D Earth generation via sparse‑decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D‑VAEs compress high‑resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition‑aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large‑scale generation. The framework further supports versatile applications, from semantic‑guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial‑Earth3D. Our project page is available at https://whiteinblue.github.io/earthcrafter/

Abstract:
We propose Segment Concept (SeC), a concept‑driven video object segmentation (VOS) framework that shifts from conventional feature matching to the progressive construction and utilization of high‑level, object‑centric representations. SeC employs Large Vision‑Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. To balance semantic reasoning with computational overhead, SeC forwards the LVLMs only when a new scene appears, injecting concept‑level features at those points. To rigorously assess VOS methods in scenarios demanding high‑level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi‑scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. Empirical evaluations demonstrate that SeC substantially outperforms state‑of‑the‑art approaches, including SAM 2 and its advanced variants, on both SeCVOS and standard VOS benchmarks. In particular, SeC achieves an 11.8‑point improvement over SAM 2.1 on SeCVOS, establishing a new state‑of‑the‑art in concept‑aware VOS.

Abstract:
Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi‑modal semantic guidance to disambiguate facial occlusion and learn high‑level semantic knowledge, which is two‑fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics‑enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi‑modal priors, we customize a Multi‑scale Cross‑interaction Module (MCM) to adaptively fuse the landmark feature and semantics‑enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model's ability to distinguish similar expressions. We further construct the first occlusion‑oriented FER dataset to facilitate specialized robustness analysis on various real‑world occlusion conditions, dubbed Occlu‑FER. Extensive experiments on both public benchmarks and Occlu‑FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet‑master.

Abstract:
In recent years, 3D generation has made great strides in both academia and industry. However, generating 3D scenes from a single RGB image remains a significant challenge, as current approaches often struggle to ensure both object generation quality and scene coherence in multi‑object scenarios. To overcome these limitations, we propose a novel three‑stage framework for 3D scene generation with explicit geometric representations and high‑quality textural details via single image‑guided model generation and spatial layout optimization. Our method begins with an image instance segmentation and inpainting phase, which recovers missing details of occluded objects in the input images, thereby achieving complete generation of foreground 3D assets. Subsequently, our approach captures the spatial geometry of reference image by constructing pseudo‑stereo viewpoint for camera parameter estimation and scene depth inference, while employing a model selection strategy to ensure optimal alignment between the 3D assets generated in the previous step and the input. Finally, through model parameterization and minimization of the Chamfer distance between point clouds in 3D and 2D space, our approach optimizes layout parameters to produce an explicit 3D scene representation that maintains precise alignment with input guidance image. Extensive experiments on multi‑object scene image sets have demonstrated that our approach not only outperforms state‑of‑the‑art methods in terms of geometric accuracy and texture fidelity of individual generated 3D models, but also has significant advantages in scene layout synthesis.

Abstract:
Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross‑modal shared semantics adversely affects generalization performance; and (2) The trade‑off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State‑Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual‑path parametric interaction mechanism. More specifically, the first cross‑parameter interaction branch inherits the advantage of cross‑attention in mining complementary information with cross‑modal hidden state decoding in SSM. The second shared‑parameter branch explores cross‑modal alignment with joint embedding to obtain cross‑modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state‑of‑the‑art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state‑of‑the‑art results on RGB‑T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.

Abstract:
Recent advances in medical image segmentation have been driven by deep learning; however, most existing methods remain limited by modality‑specific designs and exhibit poor adaptability to dynamic medical imaging scenarios. The Segment Anything Model 2 (SAM2) and its related variants, which introduce a streaming memory mechanism for real‑time video segmentation, present new opportunities for prompt‑based, generalizable solutions. Nevertheless, adapting these models to medical video scenarios typically requires large‑scale datasets for retraining or transfer learning, leading to high computational costs and the risk of catastrophic forgetting. To address these challenges, we propose DD‑SAM2, an efficient adaptation framework for SAM2 that incorporates a Depthwise‑Dilated Adapter (DD‑Adapter) to enhance multi‑scale feature extraction with minimal parameter overhead. This design enables effective fine‑tuning of SAM2 on medical videos with limited training data. Unlike existing adapter‑based methods focused solely on static images, DD‑SAM2 fully exploits SAM2's streaming memory for medical video object tracking and segmentation. Comprehensive evaluations on TrackRad2025 (tumor segmentation) and EchoNet‑Dynamic (left ventricle tracking) datasets demonstrate superior performance, achieving Dice scores of 0.93 and 0.97, respectively. To the best of our knowledge, this work provides an initial attempt at systematically exploring adapter‑based SAM2 fine‑tuning for medical video segmentation and tracking. Code, datasets, and models will be publicly available at https://github.com/apple1986/DD‑SAM2.

Abstract:
Most existing remote sensing instance segmentation approaches are designed for close‑vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open‑vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose SCORE (Scene Context matters in Open‑vocabulary REmote sensing instance segmentation), a framework that integrates multi‑granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region‑Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large‑scale, real‑world geospatial analysis. Our code is available at https://github.com/HuangShiqi128/SCORE.

Abstract:
Traditional volume visualization (VolVis) methods, like direct volume rendering, suffer from rigid transfer function designs and high computational costs. Although novel view synthesis approaches enhance rendering efficiency, they require additional learning effort for non‑experts and lack support for semantic‑level interaction. To bridge this gap, we propose NLI4VolVis, an interactive system that enables users to explore, query, and edit volumetric scenes using natural language. NLI4VolVis integrates multi‑view semantic segmentation and vision‑language models to extract and understand semantic components in a scene. We introduce a multi‑agent large language model architecture equipped with extensive function‑calling tools to interpret user intents and execute visualization tasks. The agents leverage external tools and declarative VolVis commands to interact with the VolVis engine powered by 3D editable Gaussians, enabling open‑vocabulary object querying, real‑time scene editing, best‑view selection, and 2D stylization. We validate our system through case studies and a user study, highlighting its improved accessibility and usability in volumetric data exploration. We strongly recommend readers check our case studies, demo video, and source code at https://nli4volvis.github.io/.

Abstract:
Biomedical segmentation networks easily suffer from the unexpected misclassification between foreground and background objects when learning on limited and imperfect medical datasets. Inspired by the strong power of Out‑of‑Distribution (OoD) data on other visual tasks, we propose a data‑centric framework, Med‑OoD to address this issue by introducing OoD data supervision into fully‑supervised biomedical segmentation with none of the following needs: (i) external data sources, (ii) feature regularization objectives, (iii) additional annotations. Our method can be seamlessly integrated into segmentation networks without any modification on the architectures. Extensive experiments show that Med‑OoD largely prevents various segmentation networks from the pixel misclassification on medical images and achieves considerable performance improvements on Lizard dataset. We also present an emerging learning paradigm of training a medical segmentation network completely using OoD data devoid of foreground class labels, surprisingly turning out 76.1% mIoU as test result. We hope this learning paradigm will attract people to rethink the roles of OoD data. Code is made available at https://github.com/StudioYG/Med‑OoD.

Abstract:
Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low‑pass filter, and the stacked‑layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit‑theory‑inspired strategy called Frequency‑Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low‑pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high‑pass filtering by inverting the low‑pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine‑grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state‑of‑the‑art results in single‑scale settings. The code is available at https://github.com/Linwei‑Chen/FDAM.

Abstract:
High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist‑Shannon Sampling Theorem, high‑frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided‑convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high‑frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add‑on that can densely sample the high‑frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi‑Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high‑frequency information through non‑uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at https://github.com/Linwei‑Chen/SFM.

Abstract:
ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka‑style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity. To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet‑1K. We show that the backbone‑preserving design of ThinkingViT allows it to serve as a plug‑in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin Transformers. The source code is available at https://github.com/ds‑kiel/ThinkingViT.

Abstract:
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower‑limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower‑limb amputations. Vision‑based machine learning (ML) methods offer a scalable and non‑invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi‑purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above‑knee amputees when testing multiple newly‑fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine‑tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre‑trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis‑specific tasks. Our code is available at https://github.com/pittisl/ProGait and dataset at https://huggingface.co/datasets/ericyxy98/ProGait.

Abstract:
Pixel‑level annotation is expensive and time‑consuming. Semi‑supervised segmentation methods address this challenge by learning models on few labeled images alongside a large corpus of unlabeled images. Although foundation models could further account for label scarcity, effective mechanisms for their exploitation remain underexplored. We address this by devising a novel semi‑supervised panoptic approach fueled by two dedicated foundation models. We enhance recognition by complementing unsupervised mask‑transformer consistency with zero‑shot classification of CLIP features. We enhance localization by class‑agnostic decoder warm‑up with respect to SAM pseudo‑labels. The resulting decoupled enhancement of recognition and localization (DEARLi) particularly excels in the most challenging semi‑supervised scenarios with large taxonomies and limited labeled data. Moreover, DEARLi outperforms the state of the art in semi‑supervised semantic segmentation by a large margin while requiring 8x less GPU memory, in spite of being trained only for the panoptic objective. We observe 29.9 PQ and 38.9 mIoU on ADE20K with only 158 labeled images. The source code is available at https://github.com/helen1c/DEARLi.

Abstract:
We present MoVieS, a Motion‑aware View Synthesis model that reconstructs 4D dynamic scenes from monocular videos in one second. It represents dynamic 3D scenes with pixel‑aligned Gaussian primitives and explicitly supervises their time‑varying motions. This allows, for the first time, the unified modeling of appearance, geometry and motion from monocular videos, and enables reconstruction, view synthesis and 3D point tracking within a single learning‑based framework. By bridging view synthesis with geometry reconstruction, MoVieS enables large‑scale training on diverse datasets with minimal dependence on task‑specific supervision. As a result, it also naturally supports a wide range of zero‑shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.

Abstract:
Weakly‑supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well‑lit scenarios, their performance significantly degrades in low‑light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo‑labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion‑Guided Knowledge Distillation for Weakly‑Supervised Low‑light Semantic Segmentation (DGKD‑WLSS), a novel framework that synergistically combines Diffusion‑Guided Knowledge Distillation (DGKD) with Depth‑Guided Feature Fusion (DGF2). DGKD aligns normal‑light and low‑light features via diffusion‑based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination‑invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD‑WLSS, which achieves state‑of‑the‑art performance in weakly supervised semantic segmentation tasks under low‑light conditions. The source codes have been released at:https://github.com/ChunyanWang1/DGKD‑WLSS.

Abstract:
Ultrasound imaging is a prevalent diagnostic tool known for its simplicity and non‑invasiveness. However, its inherent characteristics often introduce substantial noise, posing considerable challenges for automated lesion or organ segmentation in ultrasound video sequences. To address these limitations, we propose the Dual Semantic‑Aware Network (DSANet), a novel framework designed to enhance noise robustness in ultrasound video segmentation by fostering mutual semantic awareness between local and global features. Specifically, we introduce an Adjacent‑Frame Semantic‑Aware (AFSA) module, which constructs a channel‑wise similarity matrix to guide feature fusion across adjacent frames, effectively mitigating the impact of random noise without relying on pixel‑level relationships. Additionally, we propose a Local‑and‑Global Semantic‑Aware (LGSA) module that reorganizes and fuses temporal unconditional local features, which capture spatial details independently at each frame, with conditional global features that incorporate temporal context from adjacent frames. This integration facilitates multi‑level semantic representation, significantly improving the model's resilience to noise interference. Extensive evaluations on four benchmark datasets demonstrate that DSANet substantially outperforms state‑of‑the‑art methods in segmentation accuracy. Moreover, since our model avoids pixel‑level feature dependencies, it achieves significantly higher inference FPS than video‑based methods, and even surpasses some image‑based models. Code can be found in \hrefhttps://github.com/ZhouL2001/DSANetDSANet

Abstract:
Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large‑scale vision‑language representation learning, UDA methods for segmentation have not taken advantage of the domain‑agnostic properties of text. To address this, we present a novel Covariance‑based Pixel‑Text loss, CoPT, that uses domain‑agnostic text embeddings to learn domain‑invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at https://github.com/cfmata/CoPT.

Abstract:
This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per‑point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub‑optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low‑ambiguity points while allowing mistakes for high‑ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at https://github.com/YangChenApril/AMContrast3D.

Abstract:
While large multi‑modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine‑grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic‑Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel‑level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine‑grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state‑of‑the‑art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

Abstract:
Deep neural networks for semantic segmentation rely on large‑scale annotated datasets, leading to an annotation bottleneck that motivates few shot semantic segmentation (FSS) which aims to generalize to novel classes with minimal labeled exemplars. Most existing FSS methods adopt a prototype‑based paradigm, which generates query prior map by extracting masked‑area features from support images and then makes predictions guided by the prior map. However, they suffer from two critical limitations induced by inter‑ and intra‑image discrepancies: 1) The intra‑class gap between support and query images, caused by single‑prototype representation, results in scattered and noisy prior maps; 2) The inter‑class interference from visually similar but semantically distinct regions leads to inconsistent support‑query feature matching and erroneous predictions. To address these issues, we propose the Inter‑ and Intra‑image Refinement (IIR) model. The model contains an inter‑image class activation mapping based method that generates two prototypes for class‑consistent region matching, including core discriminative features and local specific features, and yields an accurate and robust prior map. For intra‑image refinement, a directional dropout mechanism is introduced to mask inconsistent support‑query feature pairs in cross attention, thereby enhancing decoder performance. Extensive experiments demonstrate that IIR achieves state‑of‑the‑art performance on 9 benchmarks, covering standard FSS, part FSS, and cross‑domain FSS. Our source code is available at \hrefhttps://github.com/forypipi/IIRhttps://github.com/forypipi/IIR.

Abstract:
LiDAR representation learning aims to extract rich structural and semantic information from large‑scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long‑term image‑to‑LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross‑View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy‑free memory bank; 2) a Long‑Term Feature Propagation mechanism that efficiently aligns and integrates multi‑frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross‑Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR‑based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.

Abstract:
Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end‑to‑end autonomous driving systems. Unsupervised and semi‑supervised learning for BEV tasks, as pivotal for real‑world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise‑resilient learning framework for BEV semantic segmentation. Specifically, a Perspective‑Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi‑Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non‑mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state‑of‑the‑art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi‑supervised BEV segmentation tasks, respectively. The source code will be made publicly available at https://github.com/lynn‑yu/NRSeg.

Abstract:
In semi‑supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real‑world scenarios, abundant unlabeled images are often available from online sources (web‑scraped images) or large‑scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out‑of‑distribution (OOD). Using these images as unlabeled data in semi‑supervised learning can lead to inaccurate pseudo‑labels, potentially misguiding network training. In this paper, we propose a new semi‑supervised semantic segmentation framework with an open‑vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi‑supervised learners in scenarios with few labels, and (2) using the open‑vocabulary segmentation (OVS) model to pseudo‑label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92‑label setting, achieving state‑of‑the‑art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real‑world applications. The code is available at https://github.com/wooseok‑shin/SemiOVS

Abstract:
Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer‑based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task‑specific and architecture‑tailored, limiting their general applicability. Second, they usually either adopt full attention to model long‑range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is designed to explicitly attend to the most relevant content. Theoretical analysis demonstrates that MedFormer outperforms existing medical vision transformers in terms of generality and efficiency. Extensive experiments across various imaging modality datasets show that MedFormer consistently enhances performance in all three medical image recognition tasks mentioned above. MedFormer provides an efficient and versatile solution for medical image recognition, with strong potential for clinical application.

Abstract:
As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query. This prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose Magnitude‑Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well‑balanced structure. We evaluate the effectiveness of MALA on multiple tasks, including image classification, object detection, instance segmentation, semantic segmentation, natural language processing, speech recognition, and image generation. Our MALA achieves strong results on all of these tasks. Code will be available at https://github.com/qhfan/MALA

Abstract:
Open‑vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero‑shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion‑based models naturally encode fine‑grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA‑Seg, a Fast and Accurate training‑free framework for open‑vocabulary segmentation based on diffusion models. FA‑Seg performs segmentation using only a (1+1)‑step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA‑Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA‑Seg introduces three key components: (i) a dual‑prompt mechanism for discriminative, class‑aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi‑resolution attention fusion, and (iii) a Test‑Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA‑Seg achieves state‑of‑the‑art training‑free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA‑Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA‑Seg.

Abstract:
This paper investigates indoor point cloud semantic segmentation under scene‑level annotation, which is less explored compared to methods relying on sparse point‑level labels. In the absence of precise point‑level labels, current methods first generate point‑level pseudo‑labels, which are then used to train segmentation models. However, generating accurate pseudo‑labels for each point solely based on scene‑level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high‑quality pseudo‑label generation framework by exploring contemporary multi‑modal information and region‑point semantic consistency. Specifically, with a cross‑modal feature guidance module, our method utilizes 2D‑3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene‑level annotation, we introduce a region‑point semantic consistency module. It produces regional semantics through a region‑voting strategy derived from point‑level semantics, which are subsequently employed to guide the point‑level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point‑level semantic predictions during training and obtain high‑quality pseudo‑labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene‑level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach's individual components. The code is available at https://github.com/LHDuan/WSegPC .

Abstract:
Generalized Few‑Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision‑language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi‑modal prototypes learning. However, existing prototype‑based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi‑modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty‑aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL‑5^i and COCO‑20^i datasets demonstrate that our proposed FewCLIP significantly outperforms state‑of‑the‑art approaches across both GFSS and class‑incremental setting. The code is available at https://github.com/jliu4ai/FewCLIP.

Abstract:
Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling‑enhanced prompting scheme integrating text pre‑training and a linear decoupling module to address the information processing limitations inherent in SAM‑2. Specifically, first, we devise a pre‑training paradigm that converts textual ground‑truth labels into point‑level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground‑truth annotations. Extensive experiments demonstrate state‑of‑the‑art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.

Abstract:
3D Gaussian Splatting (3DGS) has become horsepower in high‑quality, real‑time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open‑vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high‑dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate effectiveness of VoteSplat in open‑vocabulary 3D instance localization, 3D point cloud understanding, click‑based 3D object localization, hierarchical segmentation, and ablation studies. Our code is available at https://sy‑ja.github.io/votesplat/

Abstract:
As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel‑wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image‑wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background‑only images to create realistic images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background‑only images and then, during learning, create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background‑only images. While an adversarial critic could provide the divergence, we use sample‑based divergences. We conduct experiments on side‑scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The code for this work can be found at \hrefGitHubhttps://github.com/bakerhassan/WSOS.

Abstract:
Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub‑optimal segmentation. For that, we propose a novel Dual‑Perspective United Transformer (DPU‑Former) with a unique structure designed to simultaneously integrate long‑range dependencies and spatial details. In particular, we design the global‑local mixed attention, which captures diverse information through two perspectives and introduces a Fourier‑space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed‑forward network to increase the expressive ability. Additionally, we construct a DPU‑Former decoder to aggregate and strength features at different layers. Consequently, the DPU‑Former model outperforms the state‑of‑the‑art methods on multiple datasets. Code: https://github.com/CSYSI/DPU‑Former.

Abstract:
Training‑free open‑vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine‑tuning. Existing solutions often explore attention mechanisms of pre‑trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high‑quality reference set can significantly benefit training‑free OVS. With this observation, we introduce a data‑quality‑oriented framework, comprising a data pipeline to construct a reference set with well‑paired segment‑text embeddings and a simple similarity‑based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training‑free OVS approaches, highlighting the importance of data‑centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .

Abstract:
Commercial RGB‑D cameras often produce noisy, incomplete depth maps for non‑Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre‑trained text‑to‑image diffusion models to enhance generalization in dense prediction tasks. However, we find that biases arising from training‑inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non‑Lambertian regions further hinders precise prediction. To address these issues, we propose DidSee, a diffusion‑based framework for depth completion on non‑Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal‑to‑noise ratio to eliminate signal leakage bias. Second, we devise a noise‑agnostic single‑step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task‑specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine‑grained depth maps. DidSee achieves state‑of‑the‑art performance on multiple benchmarks, demonstrates robust real‑world generalization, and effectively improves downstream tasks such as category‑level pose estimation and robotic grasping.

Abstract:
3D instance segmentation aims to predict a set of object instances in a scene, representing them as binary foreground masks with corresponding semantic labels. Currently, transformer‑based methods are gaining increasing attention due to their elegant pipelines and superior predictions. However, these methods primarily focus on modeling the external relationships between scene features and query features through mask attention. They lack effective modeling of the internal relationships among scene features as well as between query features. In light of these disadvantages, we propose Relation3D: Enhancing Relation Modeling for Point Cloud Instance Segmentation. Specifically, we introduce an adaptive superpoint aggregation module and a contrastive learning‑guided superpoint refinement module to better represent superpoint features (scene features) and leverage contrastive learning to guide the updates of these features. Furthermore, our relation‑aware self‑attention mechanism enhances the capabilities of modeling relationships between queries by incorporating positional and geometric relationships into the self‑attention mechanism. Extensive experiments on the ScanNetV2, ScanNet++, ScanNet200 and S3DIS datasets demonstrate the superior performance of Relation3D.

Abstract:
The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi‑source data processing (e.g. Transformer‑based approaches) imposes significant computational overhead, presenting challenges for resource‑constrained systems. To resolve this critical limitation, we introduced CM‑SSM, an efficient RGB‑thermal semantic segmentation architecture leveraging a cross‑modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross‑modal 2D‑selective‑scan (CM‑SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross‑modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross‑modal state space association (CM‑SSA) module that effectively integrates global associations from CM‑SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer‑based approaches, CM‑SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM‑SSM achieves state‑of‑the‑art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at https://github.com/xiaodonguo/CMSSM.

Abstract:
Medical image analysis is critical yet challenged by the need of jointly segmenting organs or tissues, and numerous instances for anatomical structures and tumor microenvironment analysis. Existing studies typically formulated different segmentation tasks in isolation, which overlooks the fundamental interdependencies between these tasks, leading to suboptimal segmentation performance and insufficient medical image understanding. To address this issue, we propose a Co‑Seg++ framework for versatile medical segmentation. Specifically, we introduce a novel co‑segmentation paradigm, allowing semantic and instance segmentation tasks to mutually enhance each other. We first devise a spatio‑sequential prompt encoder (SSP‑Encoder) to capture long‑range spatial and sequential relationships between segmentation regions and image embeddings as prior spatial constraints. Moreover, we devise a multi‑task collaborative decoder (MTC‑Decoder) that leverages cross‑guidance to strengthen the contextual consistency of both tasks, jointly computing semantic and instance segmentation masks. Extensive experiments on diverse CT and histopathology datasets demonstrate that the proposed Co‑Seg++ outperforms state‑of‑the‑arts in the semantic, instance, and panoptic segmentation of dental anatomical structures, histopathology tissues, and nuclei instances. The source code is available at https://github.com/xq141839/Co‑Seg‑Plus.

Abstract:
The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end‑to‑end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA‑guided query point selection, a score‑based block merging strategy during inference, and a one‑to‑many association mechanism for effective training. By combining these new components, our model achieves state‑of‑the‑art performance for individual tree segmentation on the newly introduced FOR‑instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR‑instanceV2 dataset and the ForestFormer3D code are publicly available at https://bxiang233.github.io/FF3D/.

Abstract:
Global localization is necessary for autonomous operations on the lunar surface where traditional Earth‑based navigation infrastructure, such as GPS, is unavailable. As NASA advances toward sustained lunar presence under the Artemis program, autonomous operations will be an essential component of tasks such as robotic exploration and infrastructure deployment. Tasks such as excavation and transport of regolith require precise pose estimation, but proposed approaches such as visual‑inertial odometry (VIO) accumulate odometry drift over long traverses. Precise pose estimation is particularly important for upcoming missions such as the ISRU Pilot Excavator (IPEx) that rely on autonomous agents to operate over extended timescales and varied terrain. To help overcome odometry drift over long traverses, we propose LunarLoc, an approach to global localization that leverages instance segmentation for zero‑shot extraction of boulder landmarks from onboard stereo imagery. Segment detections are used to construct a graph‑based representation of the terrain, which is then aligned with a reference map of the environment captured during a previous session using graph‑theoretic data association. This method enables accurate and drift‑free global localization in visually ambiguous settings. LunarLoc achieves sub‑cm level accuracy in multi‑session global localization experiments, significantly outperforming the state of the art in lunar global localization. To encourage the development of further methods for global localization on the Moon, we release our datasets publicly with a playback module: https://github.com/mit‑acl/lunarloc‑data.

Abstract:
Mamba, a State Space Model (SSM) that accelerates training by recasting recurrence as a parallel scan, has recently emerged as a linearly‑scaling alternative to self‑attention. Because of its unidirectional nature, each state in Mamba only has information of its previous states and is blind to states after. Current Mamba‑based computer‑vision methods typically overcome this by augmenting Mamba's global forward scan with a global backward scan, forming a bi‑directional scan to restore a full receptive field. However, this operation doubles the computational load, eroding much of the efficiency advantage that originally Mamba have. To eliminate this extra scans, we introduce LBMamba, a locally bi‑directional SSM block that embeds a lightweight locally backward scan inside the forward scan and executes it in per‑thread registers. Building on LBMamba, we present LBVim, a backbone that alternates scan directions every two layers to recover a global receptive field without extra backward sweeps. We validate our approach on both natural images and whole slide images (WSIs) and show that it constantly offers a superior performance‑throughput trade‑off. Under the same throughput, LBVim achieves 0.8% to 1.6% higher top‑1 accuracy on the ImageNet‑1K classification dataset, 0.6% to 2.7% higher mIoU on the ADE20K semantic segmentation dataset, 0.9% higher APb and 1.1% higher APm on the COCO detection dataset. Our method also boosts the accuracy of four SOTA Mamba models, namely VMamba, LocalVim, PlainMamba and Adventurer, by 0.5% to 3.4%. We integrate LBMamba into the SOTA pathology multiple instance learning (MIL) model, MambaMIL, which is unidirectional. Experiments on 3 public WSI classification datasets show that our method achieves a relative improvement of up to 3.06% better AUC, 3.39% better F1, 1.67% better accuracy. Our code is available at https://github.com/cvlab‑stonybrook/LBMamba.

Abstract:
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self‑attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self‑attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self‑attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state‑of‑the‑art approaches based on both state‑space models and Transformers. For example, our proposed PPMA‑T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT‑T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

Abstract:
In autonomous driving, high‑definition (HD) maps and semantic maps in bird's‑eye view (BEV) are essential for accurate localization, planning, and decision‑making. This paper introduces an enhanced End‑to‑End model named MapFM for online vectorized HD map generation. We show significantly boost feature representation quality by incorporating powerful foundation model for encoding camera images. To further enrich the model's understanding of the environment and improve prediction quality, we integrate auxiliary prediction heads for semantic segmentation in the BEV representation. This multi‑task learning approach provides richer contextual supervision, leading to a more comprehensive scene representation and ultimately resulting in higher accuracy and improved quality of the predicted vectorized HD maps. The source code is available at https://github.com/LIvanoff/MapFM.

Abstract:
Point cloud analysis is the cornerstone of many downstream tasks, among which aggregating local structures is the basis for understanding point cloud data. While numerous works aggregate neighbor using three‑dimensional relative coordinates, there are irrelevant point interference and feature hierarchy gap problems due to the limitation of local coordinates. Although some works address this limitation by refining spatial description though explicit modeling of cross‑stage structure, these enhancement methods based on direct geometric structure encoding have problems of high computational overhead and noise sensitivity. To overcome these problems, we propose the Point Distribution Set Abstraction module (PDSA) that utilizes the correlation in the high‑dimensional space to correct the feature distribution during aggregation, which improves the computational efficiency and robustness. PDSA distinguishes the point correlation based on a lightweight cross‑stage structural descriptor, and enhances structural homogeneity by reducing the variance of the neighbor feature matrix and increasing classes separability though long‑distance modeling. Additionally, we introducing a key point mechanism to optimize the computational overhead. The experimental result on semantic segmentation and classification tasks based on different baselines verify the generalization of the method we proposed, and achieve significant performance improvement with less parameter cost. The corresponding ablation and visualization results demonstrate the effectiveness and rationality of our method. The code and training weight is available at: https://github.com/AGENT9717/PointDistribution

Abstract:
Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real‑time inference remains a challenge. To address this, we propose a logits‑based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real‑time efficiency. Specifically, we adopt a Bird's Eye View (BEV) projection‑based model as the student and a non‑projection model as the teacher. To handle the severe imbalance between moving and non‑moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion‑related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI‑MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at https://github.com/SCNU‑RISLAB/KDMOS.

Abstract:
We propose a novel active learning framework for multi‑view semantic segmentation. This framework relies on a new score that measures the discrepancy between point cloud distributions generated from the extra geometrical information derived from the model's prediction across different views. Our approach results in a data efficient and explainable active learning method. The source code is available at https://github.com/chilai235/viewpclAL.

Abstract:
Existing segmentation models trained on a single medical imaging dataset often lack robustness when encountering unseen organs or tumors. Developing a robust model capable of identifying rare or novel tumor categories not present during training is crucial for advancing medical imaging applications. We propose DSM, a novel framework that leverages diffusion and state space models to segment unseen tumor categories beyond the training data. DSM utilizes two sets of object queries trained within modified attention decoders to enhance classification accuracy. Initially, the model learns organ queries using an object‑aware feature grouping strategy to capture organ‑level visual features. It then refines tumor queries by focusing on diffusion‑based visual prompts, enabling precise segmentation of previously unseen tumors. Furthermore, we incorporate diffusion‑guided feature fusion to improve semantic segmentation performance. By integrating CLIP text embeddings, DSM captures category‑sensitive classes to improve linguistic transfer knowledge, thereby enhancing the model's robustness across diverse scenarios and multi‑label tasks. Extensive experiments demonstrate the superior performance of DSM in various tumor segmentation tasks. Code is available at https://github.com/Rows21/k‑Means_Mask_Mamba.

Abstract:
Instance segmentation of prohibited items in security X‑ray images is a critical yet challenging task. This is mainly caused by the significant appearance gap between prohibited items in X‑ray images and natural objects, as well as the severe overlapping among objects in X‑ray images. To address these issues, we propose an occlusion‑aware instance segmentation pipeline designed to identify prohibited items in X‑ray images. Specifically, to bridge the representation gap, we integrate the Segment Anything Model (SAM) into our pipeline, taking advantage of its rich priors and zero‑shot generalization capabilities. To address the overlap between prohibited items, we design an occlusion‑aware bilayer mask decoder module that explicitly models the occlusion relationships. To supervise occlusion estimation, we manually annotated occlusion areas of prohibited items in two large‑scale X‑ray image segmentation datasets, PIDray and PIXray. We then reorganized these additional annotations together with the original information as two occlusion‑annotated datasets, PIDray‑A and PIXray‑A. Extensive experimental results on these occlusion‑annotated datasets demonstrate the effectiveness of our proposed method. The datasets and codes are available at: https://github.com/Ryh1218/Occ

Abstract:
Learning medical visual representations from image‑report pairs through joint learning has garnered increasing research attention due to its potential to alleviate the data scarcity problem in the medical domain. The primary challenges stem from the lengthy reports that feature complex discourse relations and semantic pathologies. Previous works have predominantly focused on instance‑wise or token‑wise cross‑modal alignment, often neglecting the importance of pathological‑level consistency. This paper presents a novel framework PLACE that promotes the Pathological‑Level Alignment and enriches the fine‑grained details via Correlation Exploration without additional human annotations. Specifically, we propose a novel pathological‑level cross‑modal alignment (PCMA) approach to maximize the consistency of pathology observations from both images and reports. To facilitate this, a Visual Pathology Observation Extractor is introduced to extract visual pathological observation representations from localized tokens. The PCMA module operates independently of any external disease annotations, enhancing the generalizability and robustness of our methods. Furthermore, we design a proxy task that enforces the model to identify correlations among image patches, thereby enriching the fine‑grained details crucial for various downstream tasks. Experimental results demonstrate that our proposed framework achieves new state‑of‑the‑art performance on multiple downstream tasks, including classification, image‑to‑text retrieval, semantic segmentation, object detection and report generation. Code is available at https://github.com/Markin‑Wang/PLACE.

Abstract:
Open‑Vocabulary semantic segmentation (OVSS) and domain generalization in semantic segmentation (DGSS) highlight a subtle complementarity that motivates Open‑Vocabulary Domain‑Generalized Semantic Segmentation (OV‑DGSS). OV‑DGSS aims to generate pixel‑level masks for unseen categories while maintaining robustness across unseen domains, a critical capability for real‑world scenarios such as autonomous driving in adverse conditions. We introduce Vireo, a novel single‑stage framework for OV‑DGSS that unifies the strengths of OVSS and DGSS for the first time. Vireo builds upon the frozen Visual Foundation Models (VFMs) and incorporates scene geometry via Depth VFMs to extract domain‑invariant structural features. To bridge the gap between visual and textual modalities under domain shift, we propose three key components: (1) GeoText Prompts, which align geometric features with language cues and progressively refine VFM encoder representations; (2) Coarse Mask Prior Embedding (CMPE) for enhancing gradient flow for faster convergence and stronger textual influence; and (3) the Domain‑Open‑Vocabulary Vector Embedding Head (DOV‑VEH), which fuses refined structural and semantic features for robust prediction. Comprehensive evaluation on these components demonstrates the effectiveness of our designs. Our proposed Vireo achieves the state‑of‑the‑art performance and surpasses existing methods by a large margin in both domain generalization and open‑vocabulary recognition, offering a unified and scalable solution for robust visual understanding in diverse and dynamic environments. Code is available at https://github.com/anonymouse‑9c53tp182bvz/Vireo.

Abstract:
Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection‑based, contour‑based, and distance mapping‑based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four‑color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four‑color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non‑uniqueness of four‑color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state‑of‑the‑art performance. The code is available at https://github.com/zhangye‑zoe/FCIS.

Abstract:
Historical satellite imagery archive, such as Keyhole satellite data, offers rare insights into understanding early urban development and long‑term transformation. However, severe quality degradation (e.g., distortion, misalignment, and spectral scarcity) and the absence of annotations have long hindered its analysis. To bridge this gap and enhance understanding of urban development, we introduce WakeupUrbanBench, an annotated segmentation dataset based on historical satellite imagery with the earliest observation time among all existing remote sensing (RS) datasets, along with a framework for unsupervised segmentation tasks, WakeupUSM. First, WakeupUrbanBench serves as a pioneer, expertly annotated dataset built on mid‑20^\textth century RS imagery, involving four key urban classes and spanning 4 cities across 2 continents with nearly 1000 km^2 area of diverse urban morphologies, and additionally introducing one present‑day city. Second, WakeupUSM is a novel unsupervised semantic segmentation framework for historical RS imagery. It employs a confidence‑aware alignment mechanism and focal‑confidence loss based on a self‑supervised learning architecture, which generates robust pseudo‑labels and adaptively prioritizes prediction difficulty and label reliability to improve unsupervised segmentation on noisy historical data without manual supervision. Comprehensive experiments demonstrate WakeupUSM significantly outperforms existing unsupervised segmentation methods both WakeupUrbanBench and public dataset, promising to pave the way for quantitative studies of long‑term urban change using modern computer vision. Our benchmark and codes will be released at https://github.com/Tianxiang‑Hao/WakeupUrban.

Abstract:
Remote sensing image interpretation plays a critical role in environmental monitoring, urban planning, and disaster assessment. However, acquiring high‑quality labeled data is often costly and time‑consuming. To address this challenge, we proposes a multi‑modal self‑supervised learning framework that leverages high‑resolution RGB images, multi‑spectral data, and digital surface models (DSM) for pre‑training. By designing an information‑aware adaptive masking strategy, cross‑modal masking mechanism, and multi‑task self‑supervised objectives, the framework effectively captures both the correlations across different modalities and the unique feature structures within each modality. We evaluated the proposed method on multiple downstream tasks, covering typical remote sensing applications such as scene classification, semantic segmentation, change detection, object detection, and depth estimation. Experiments are conducted on 15 remote sensing datasets, encompassing 26 tasks. The results demonstrate that the proposed method outperforms existing pretraining approaches in most tasks. Specifically, on the Potsdam and Vaihingen semantic segmentation tasks, our method achieved mIoU scores of 78.30% and 76.50%, with only 50% train‑set. For the US3D depth estimation task, the RMSE error is reduced to 0.182, and for the binary change detection task in SECOND dataset, our method achieved mIoU scores of 47.51%, surpassing the second CS‑MAE by 3 percentage points. Our pretrain code, checkpoints, and HR‑Pairs dataset can be found in https://github.com/CVEO/MSSDF.

Abstract:
We study the problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per‑point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo‑labels for training a segmentation network. Extensive experiments on two indoor and an outdoor datasets show that our LogoSP surpasses all existing unsupervised methods by large margins, achieving the state‑of‑the‑art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.

Abstract:
We introduce a trend‑aware and visually‑grounded fashion recommendation system that integrates deep visual representations, garment‑aware segmentation, semantic category similarity and user behavior simulation. Our pipeline extracts focused visual embeddings by masking non‑garment regions via semantic segmentation followed by feature extraction using pretrained CNN backbones (ResNet‑50, DenseNet‑121, VGG16). To simulate realistic shopping behavior, we generate synthetic purchase histories influenced by user‑specific trendiness and item popularity. Recommendations are computed using a weighted scoring function that fuses visual similarity, semantic coherence and popularity alignment. Experiments on the DeepFashion dataset demonstrate consistent gender alignment and improved category relevance, with ResNet‑50 achieving 64.95% category similarity and lowest popularity MAE. An ablation study confirms the complementary roles of visual and popularity cues. Our method provides a scalable framework for personalized fashion recommendations that balances individual style with emerging trends. Our implementation is available at https://github.com/meddjilani/FashionRecommender

Abstract:
Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi‑object images. Specifically, we construct the multi‑object images by stitching the single object centric ones, where the objects in the synthesized multi‑object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi‑object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi‑object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi‑object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.

Abstract:
While promptable segmentation (e.g., SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task‑generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task‑generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) semantic ambiguity in getting instance‑specific text prompts, which arises from insufficient discriminative cues in holistic captions, leading to foreground‑background confusion; 2) semantic discrepancy combined with spatial separation in getting instance‑specific visual prompts, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose RDVP‑MSD, a novel training‑free test‑time adaptation framework that synergizes Region‑constrained Dual‑stream Visual Prompting (RDVP) via Multimodal Stepwise Decomposition Chain of Thought (MSD‑CoT). MSD‑CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP‑MSD achieves a state‑of‑the‑art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \hrefhttps://github.com/ycyinchao/RDVP‑MSDhttps://github.com/ycyinchao/RDVP‑MSD

Abstract:
In this report, we present a cross‑view multi‑modal object segmentation approach for the object correspondence task in the Ego‑Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhances object localization by leveraging both visual masks and textual descriptions as segmentation conditions. Furthermore, to address the visual domain gap between ego and exo views, we introduce a cross‑view object alignment module that enforces object‑level consistency across perspectives, thereby improving the model's robustness to viewpoint changes. Our proposed method ranked second on the leaderboard of the large‑scale Ego‑Exo4D object correspondence benchmark. Code will be made available at https://github.com/lovelyqian/ObjectRelator.

Abstract:
Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high‑quality variant SAM‑HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP‑based embeddings derived from user‑provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM‑HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user‑controllable segmentation, enabling disambiguation of objects within a single bounding box based on textual input. We evaluate our approach on three benchmarks: BIG, ThinObject5K, and DIS5K. Talk2SAM consistently outperforms SAM‑HQ, achieving up to +5.9% IoU and +8.3% boundary IoU improvements. Our results demonstrate that incorporating natural language guidance provides a flexible and effective means for precise object segmentation, particularly in cases where traditional prompt‑based methods fail. The source code is available on GitHub: https://github.com/richlukich/Talk2SAM

Abstract:
Spatio‑temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video‑based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine‑grained spatio‑temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two‑step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask‑fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video‑caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS‑Bench, a challenging out‑of‑distribution benchmark spanning five real‑world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video‑GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer‑VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio‑temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai‑oryx/VideoMolmo.

Abstract:
Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre‑defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two‑dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high‑precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large‑scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.

Abstract:
The data scarcity, label noise, and long‑tailed category imbalance remain important and unresolved challenges in many computer vision tasks, such as object detection and instance segmentation, especially on large‑vocabulary benchmarks like LVIS, where most categories appear in only a few images. Current synthetic data generation methods still suffer from multiple objects per mask, inaccurate segmentation, incorrect category labels, and other issues, limiting their effectiveness. To address these issues, we introduce Gen‑n‑Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), a Large Language Model (LLM), and a Vision Large Language Model (VLLM) to produce high‑quality and diverse instance masks and images for object detection and instance segmentation. Gen‑n‑Val consists of two agents: (1) the LD prompt agent, an LLM, optimizes rompts to encourage LD to generate high‑quality foreground single‑object images and corresponding segmentation masks; and (2) the data validation agent, a VLLM, filters out low‑quality synthetic instance images. The system prompts for both agents are optimized by TextGrad. Compared to state‑of‑the‑art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 7.6% on rare classes in LVIS instance segmentation with Mask R‑CNN, and by 3.6% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen‑n‑Val shows significant improvements (7.1% mAP) over YOLO‑Worldv2‑M in open‑vocabulary object detection benchmarks with YOLO11m. Moreover, Gen‑n‑Val has scalability in model capacity and dataset size. The code is available at https://github.com/aiiu‑lab/Gen‑n‑Val.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) entails assigning semantic labels to each pixel in an image using textual descriptions, typically leveraging world models such as CLIP. To enhance out‑of‑domain generalization, we propose Cost Aggregation with Optimal Transport (OV‑COAST) for open‑vocabulary semantic segmentation. To align visual‑language features within the framework of optimal transport theory, we employ cost volume to construct a cost matrix, which quantifies the distance between two distributions. Our approach adopts a two‑stage optimization strategy: in the first stage, the optimal transport problem is solved using cost volume via Sinkhorn distance to obtain an alignment solution; in the second stage, this solution is used to guide the training of the CAT‑Seg model. We evaluate state‑of‑the‑art OVSS models on the MESS benchmark, where our approach notably improves the performance of the cost‑aggregation model CAT‑Seg with ViT‑B backbone, achieving superior results, surpassing CAT‑Seg by 1.72 % and SAN‑B by 4.9 % mIoU. The code is available at https://github.com/adityagandhamal/OV‑COAST/https://github.com/adityagandhamal/OV‑COAST/ .

Abstract:
Referring Remote Sensing Image Segmentation is a complex and challenging task that integrates the paradigms of computer vision and natural language processing. Existing datasets for RRSIS suffer from critical limitations in resolution, scene diversity, and category coverage, which hinders the generalization and real‑world applicability of refer segmentation models. To facilitate the development of this field, we introduce NWPU‑Refer, the largest and most diverse RRSIS dataset to date, comprising 15,003 high‑resolution images (1024‑2048px) spanning 30+ countries with 49,745 annotated targets supporting single‑object, multi‑object, and non‑object segmentation scenarios. Additionally, we propose the Multi‑scale Referring Segmentation Network (MRSNet), a novel framework tailored for the unique demands of RRSIS. MRSNet introduces two key innovations: (1) an Intra‑scale Feature Interaction Module (IFIM) that captures fine‑grained details within each encoder stage, and (2) a Hierarchical Feature Interaction Module (HFIM) to enable seamless cross‑scale feature fusion, preserving spatial integrity while enhancing discriminative power. Extensive experiments conducte on the proposed NWPU‑Refer dataset demonstrate that MRSNet achieves state‑of‑the‑art performance across multiple evaluation metrics, validating its effectiveness. The dataset and code are publicly available at https://github.com/CVer‑Yang/NWPU‑Refer.

Abstract:
Autonomous driving simulators still lack high‑fidelity radar, even though radar is critical for robust perception in adverse weather. A key obstacle is that raw radar point clouds are extremely sparse and stochastic, making it difficult to model; we argue that simulating the full range‑azimuth‑Doppler cube is a more principled target. Existing radar cube simulators either rely purely on neural generators, which are opaque and offer little control over sensor attributes, or on detailed electromagnetic pipelines, which are slow, require proprietary hardware specifications, and still struggle to capture real‑world complexity. We introduce Ctrl‑RS, a controllable radar cube simulation framework that combines the strengths of both worlds. First, we build an environment reflection tensor from diverse sensor sources (including LiDAR, monocular cameras, and existing radar). Second, we abstract radar physics into a compact set of waveform parameters that characterize the 3D point spread function, yielding an intuitive embedding of radar attributes such as range resolution, Doppler broadening, and azimuth beam shape. Third, we train a WARP‑Net on a large mixed dataset that fuses real, analytically synthesized, and simulator‑generated radar cubes to cover a wide distribution of radar attributes. Ctrl‑RS supports viewpoint changes, actor removal, and attribute editing. Experiments on RADDet, Carrada, and nuScenes show that our simulated data can match or surpass real radar in 2D detection and semantic segmentation, and consistently boosts performance in 3D detection when combined with real data. The Project is available at https://github.com/zhuxing0/Ctrl‑RS.

Abstract:
Existing semantic SLAM in dynamic environments mainly identify dynamic regions through object detection or semantic segmentation methods. However, in certain highly dynamic scenarios, the detection boxes or segmentation masks cannot fully cover dynamic regions. Therefore, this paper proposes a robust and efficient GeneA‑SLAM2 system that leverages depth variance constraints to handle dynamic scenes. Our method extracts dynamic pixels via depth variance and creates precise depth masks to guide the removal of dynamic objects. Simultaneously, an autoencoder is used to reconstruct keypoints, improving the genetic resampling keypoint algorithm to obtain more uniformly distributed keypoints and enhance the accuracy of pose estimation. Our system was evaluated on multiple highly dynamic sequences. The results demonstrate that GeneA‑SLAM2 maintains high accuracy in dynamic scenes compared to current methods. Code is available at: https://github.com/qingshufan/GeneA‑SLAM2.

Abstract:
Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction‑aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine‑grained understanding of object relationships, as many video events are defined by such relationships rather than individual objects. To support this task, we propose a new evaluation protocol that separately evaluates actor and target segmentation, enabling more accurate assessment of the model's ability to distinguish and segment actor and target roles. We also present InterRVOS‑127K, a large‑scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects. Furthermore, we develop ReVIOSa, an MLLM‑based architecture that introduces interaction‑aware special tokens and leverages an attention mask loss to enhance role‑specific segmentation. Extensive experiments show that ReVIOSa not only outperforms existing baselines on our proposed InterRVOS‑127K evaluation set, but also achieves strong performance on standard RVOS benchmarks. Our project page is available at: https://cvlab‑kaist.github.io/InterRVOS.

Abstract:
We introduce a new task, Map and Locate, which unifies the traditionally distinct objectives of open‑vocabulary segmentation ‑ detecting and segmenting object instances based on natural language queries ‑ and 3D reconstruction, the process of estimating a scene's 3D structure from visual inputs. Specifically, Map and Locate involves generating a point cloud from an unposed video and segmenting object instances based on open‑vocabulary queries. This task serves as a critical step toward real‑world embodied AI applications and introduces a practical task that bridges reconstruction, recognition and reorganization. To tackle this task, we introduce a simple yet effective baseline, which we denote as SAB3R. Our approach builds upon MASt3R, a recent breakthrough in 3D computer vision, and incorporates a lightweight distillation strategy. This method transfers dense, per‑pixel semantic features from 2D vision backbones (eg, CLIP and DINOv2) to enhance MASt3R's capabilities. Without introducing any auxiliary frozen networks, our model generates per‑pixel semantic features and constructs cohesive point maps in a single forward pass. Compared to separately deploying MASt3R and CLIP, our unified model, SAB3R, achieves superior performance on the Map and Locate benchmark. Furthermore, we evaluate SAB3R on both 2D semantic segmentation and 3D tasks to comprehensively validate its effectiveness.

Abstract:
We study the challenging problem of unsupervised multi‑object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real‑world objects. In this paper, we introduce unMORE, a novel two‑stage pipeline designed to identify many complex objects in real‑world images. The key to our approach involves explicitly learning three levels of carefully defined object‑centric representations in the first stage. Subsequently, our multi‑object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network‑free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real‑world benchmark datasets, including the challenging COCO dataset, achieving state‑of‑the‑art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse.

Abstract:
Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state‑of‑the‑art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real‑world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision‑making capabilities. By explicitly incorporating this type of high‑level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation‑aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real‑world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state‑of‑the‑art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real‑world experiments highlight the effectiveness of semantic segmentation in mitigating the sim‑to‑real gap, making our model a promising solution for practical VSN‑based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

Abstract:
Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM‑I2V, an effective image‑to‑video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre‑trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce three key innovations: (i) an image‑to‑video feature extraction upgrader built upon SAM's static image encoder to enable spatiotemporal video perception, (ii) a memory filtering strategy that selects the most relevant past frames for more effective utilization of historical information, and (iii) a memory‑as‑prompt mechanism leveraging object memory to ensure temporally consistent mask propagation in dynamic scenes. Comprehensive experiments demonstrate that our method achieves over 90% of SAM 2's performance while using only 0.2% of its training cost. Our work presents a resource‑efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader applications and advancements in the field. Code and model are available at: https://github.com/showlab/SAM‑I2V.

Abstract:
David Marr's seminal theory of human perception stipulates that visual processing is a multi‑stage process, prioritizing the derivation of boundary and surface properties before forming semantic object representations. In contrast, contrastive representation learning frameworks typically bypass this explicit multi‑stage approach, defining their objective as the direct learning of a semantic representation space for objects. While effective in general contexts, this approach sacrifices the inductive biases of vision, leading to slower convergence speed and learning shortcut resulting in texture bias. In this work, we demonstrate that leveraging Marr's multi‑stage theory‑by first constructing boundary and surface‑level representations using perceptual constructs from early visual processing stages and subsequently training for object semantics‑leads to 2x faster convergence on ResNet18, improved final representations on semantic segmentation, depth estimation, and object recognition, and enhanced robustness and out‑of‑distribution capability. Together, we propose a pretraining stage before the general contrastive representation pretraining to further enhance the final representation quality and reduce the overall convergence time via inductive bias from human vision systems.

Abstract:
Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments, wherein event modeling is crucial for partitioning the video into smaller temporal events that partially correspond to the text. Previous methods typically segment videos into a fixed number of equal‑length clips, resulting in ambiguous event boundaries. Additionally, they rely on mean pooling to compute event representations, inevitably introducing undesired misalignment. To address these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first introduce the Progressive‑Grouped Video Segmentation (PGVS) module, to iteratively formulate events in light of both temporal dependencies and semantic similarity between consecutive frames, enabling clear event boundaries. Furthermore, we also propose the Context‑Aware Event Refinement (CAER) module to refine the event representation conditioned the text's cross‑attention. This enables event representations to focus on the most relevant frames for a given text, facilitating more precise text‑video alignment. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance on two PRVR benchmarks. Code is available at https://github.com/Sasa77777779/UEM.git.

Abstract:
LiDAR semantic segmentation plays a vital role in autonomous driving. Existing voxel‑based methods for LiDAR semantic segmentation apply uniform partition to the 3D LiDAR point cloud to form a structured representation based on cartesian/cylindrical coordinates. Although these methods show impressive performance, the drawback of existing voxel‑based methods remains in two aspects: (1) it requires a large enough input voxel resolution, which brings a large amount of computation cost and memory consumption. (2) it does not well handle the unbalanced point distribution of LiDAR point cloud. In this paper, we propose a non‑uniform cylindrical partition network named NUC‑Net to tackle the above challenges. Specifically, we propose the Arithmetic Progression of Interval (API) method to non‑uniformly partition the radial axis and generate the voxel representation which is representative and efficient. Moreover, we propose a non‑uniform multi‑scale aggregation method to improve contextual information. Our method achieves state‑of‑the‑art performance on SemanticKITTI and nuScenes datasets with much faster speed and much less training time. And our method can be a general component for LiDAR semantic segmentation, which significantly improves both the accuracy and efficiency of the uniform counterpart by 4 × training faster and 2 × GPU memory reduction and 3 × inference speedup. We further provide theoretical analysis towards understanding why NUC is effective and how point distribution affects performance. Code is available at \hrefhttps://github.com/alanWXZ/NUC‑Nethttps://github.com/alanWXZ/NUC‑Net.

Abstract:
In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human‑object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off‑the‑shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine‑grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off‑the‑shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at https://github.com/woyut/WSAG‑PLSP .

Abstract:
Image‑text models excel at image‑level tasks but struggle with detailed visual understanding. While these models provide strong visual‑language alignment, segmentation models like SAM2 offer precise spatial boundaries for objects. To this end, we propose TextRegion, a simple, effective, and training‑free framework that combines the strengths of image‑text models and SAM2 to generate powerful text‑aligned region tokens. These tokens enable detailed visual understanding while preserving open‑vocabulary capabilities. They can be directly applied to various downstream tasks, including open‑world semantic segmentation, referring expression comprehension, and grounding. We conduct extensive evaluations and consistently achieve superior or competitive performance compared to state‑of‑the‑art training‑free methods. Additionally, our framework is compatible with many image‑text models, making it highly practical and easily extensible as stronger models emerge. Code is available at: https://github.com/avaxiao/TextRegion.

Abstract:
This work explores the application of Federated Learning (FL) to Unsupervised Semantic image Segmentation (USS). Recent USS methods extract pixel‑level features using frozen visual foundation models and refine them through self‑supervised objectives that encourage semantic grouping. These features are then grouped to semantic clusters to produce segmentation masks. Extending these ideas to federated settings requires feature representation and cluster centroid alignment across distributed clients, an inherently difficult task under heterogeneous data distributions in the absence of supervision. To address this, we propose FUSS (Federated Unsupervised image Semantic Segmentation) which is, to our knowledge, the first framework to enable fully decentralized, label‑free semantic segmentation training. FUSS introduces novel federation strategies that promote global consistency in feature and prototype space, jointly optimizing local segmentation heads and shared semantic centroids. Experiments on both benchmark and real‑world datasets, including binary and multi‑class segmentation tasks, show that FUSS consistently outperforms local‑only client trainings as well as extensions of classical FL algorithms under varying client data distributions. To fully support reproducibility, the source code, data partitioning scripts, and implementation details are publicly available at: https://github.com/evanchar/FUSS

Abstract:
Unsupervised domain adaptation for semantic segmentation (UDA‑SS) aims to transfer knowledge from labeled source data to unlabeled target data. However, traditional UDA‑SS methods assume that category settings between source and target domains are known, which is unrealistic in real‑world scenarios. This leads to performance degradation if private classes exist. To address this limitation, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA‑SS), achieving robust adaptation even without prior knowledge of category settings. We define the problem in the UniDA‑SS scenario as low confidence scores of common classes in the target domain, which leads to confusion with private classes. To solve this problem, we propose UniMAP: UniDA‑SS with Image Matching and Prototype‑based Distinction, a novel framework composed of two key components. First, Domain‑Specific Prototype‑based Distinction (DSPD) divides each class into two domain‑specific prototypes, enabling finer separation of domain‑specific features and enhancing the identification of common classes across domains. Second, Target‑based Image Matching (TIM) selects a source image containing the most common‑class pixels based on the target pseudo‑label and pairs it in a batch to promote effective learning of common classes. We also introduce a new UniDA‑SS benchmark and demonstrate through various experiments that UniMAP significantly outperforms baselines. The code is available at https://github.com/KU‑VGI/UniMAP.

Abstract:
Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE‑SAM, adapting SAM for the PTCHS task. Technically, we develop a Context‑Aware Prompt Adapter to achieve better prompt tokens via incorporating global‑local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High‑Fidelity Mask Decoder by leveraging multi‑granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE‑SAM and advance this field, we construct the ELE‑40K benchmark, the first large‑scale and real‑world dataset for PTCHS including 44,094 image‑mask pairs. Experimental results for ELE‑40K demonstrate the superior performance that ELE‑SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state‑of‑the‑art method on HQSeg‑44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high‑quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE‑SAM.

Abstract:
Autonomous driving datasets are essential for validating the progress of intelligent vehicle algorithms, which include localization, perception, and prediction. However, existing datasets are predominantly focused on structured urban environments, which limits the exploration of unstructured and specialized scenarios, particularly those characterized by significant dust levels. This paper introduces the LiDARDustX dataset, which is specifically designed for perception tasks under high‑dust conditions, such as those encountered in mining areas. The LiDARDustX dataset consists of 30,000 LiDAR frames captured by six different LiDAR sensors, each accompanied by 3D bounding box annotations and point cloud semantic segmentation. Notably, over 80% of the dataset comprises dust‑affected scenes. By utilizing this dataset, we have established a benchmark for evaluating the performance of state‑of‑the‑art 3D detection and segmentation algorithms. Additionally, we have analyzed the impact of dust on perception accuracy and delved into the causes of these effects. The data and further information can be accessed at: https://github.com/vincentweikey/LiDARDustX.

Abstract:
Recently, test‑time adaptation has attracted wide interest in the context of vision‑language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open‑Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi‑Level and Multi‑Prompt (MLMP) entropy minimization integrates features from intermediate vision‑encoder layers and is performed with different text‑prompt templates at both the global CLS token and local pixel‑wise levels. Our approach could be used as plug‑and‑play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, with a total of 87 distinct test scenarios, establishing a standardized and comprehensive testbed for future TTA research in open‑vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation‑tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.

Abstract:
Long‑horizon video‑audio reasoning and fine‑grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low‑resolution frames, whereas precise grounding calls for high‑resolution inputs. We tackle this trade‑off with a two‑system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel‑level grounding on the selected high‑resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni‑R1, an end‑to‑end RL framework built on Group Relative Policy Optimization. Omni‑R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio‑Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni‑R1 not only surpasses strong supervised baselines but also outperforms specialized state‑of‑the‑art models, while substantially improving out‑of‑domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large‑scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

Abstract:
Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi‑supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero‑shot annotation with the Segment Anything Model (SAM) and Contrastive Language‑Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual‑Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO. The project web page is available at: https://gsisaoki.github.io/ZERO‑SHOT‑PLG/

Abstract:
Image segmentation remains a challenging task in computer vision, demanding robust mask generation and precise classification. Recent mask‑based approaches yield high‑quality masks by capturing global context. However, accurately classifying these masks, especially in the presence of ambiguous boundaries and imbalanced class distributions, remains an open challenge. In this work, we introduce ViT‑P, a novel two‑stage segmentation framework that decouples mask generation from classification. The first stage employs a proposal generator to produce class‑agnostic mask proposals, while the second stage utilizes a point‑based classification model built on the Vision Transformer (ViT) to refine predictions by focusing on mask central points. ViT‑P serves as a pre‑training‑free adapter, allowing the integration of various pre‑trained vision transformers without modifying their architecture, ensuring adaptability to dense prediction tasks. Furthermore, we demonstrate that coarse and bounding box annotations can effectively enhance classification without requiring additional training on fine annotation datasets, reducing annotation costs while maintaining strong performance. Extensive experiments across COCO, ADE20K, and Cityscapes datasets validate the effectiveness of ViT‑P, achieving state‑of‑the‑art results with 54.0 PQ on ADE20K panoptic segmentation, 87.4 mIoU on Cityscapes semantic segmentation, and 63.6 mIoU on ADE20K semantic segmentation. The code and pretrained models are available at: https://github.com/sajjad‑sh33/ViT‑Phttps://github.com/sajjad‑sh33/ViT‑P.

Abstract:
Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally‑sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT‑RVS, a novel framework employing the zero‑shot Chain‑of‑Thought (CoT) capability of MLLM to address these complex challenges by temporal‑semantic reasoning: CoT‑RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT‑RVS framework is training‑free and compatible with closed‑source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework's training‑free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT‑RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

Abstract:
We introduce the Region Encoder Network (REN), a fast and effective model for generating region‑based image representations using point prompts. Recent methods combine class‑agnostic segmenters (e.g., SAM) with patch‑based image encoders (e.g., DINO) to produce compact and effective region representations, but they suffer from high computational cost due to the segmentation step. REN bypasses this bottleneck using a lightweight module that directly generates region tokens, enabling 60x faster token generation with 35x less memory, while also improving token quality. It uses a few cross‑attention blocks that take point prompts as queries and features from a patch‑based image encoder as keys and values to produce region tokens that correspond to the prompted objects. We train REN with three popular encoders‑DINO, DINOv2, and OpenCLIP‑and show that it can be extended to other encoders without dedicated training. We evaluate REN on semantic segmentation and retrieval tasks, where it consistently outperforms the original encoders in both performance and compactness, and matches or exceeds SAM‑based region methods while being significantly faster. Notably, REN achieves state‑of‑the‑art results on the challenging Ego4D VQ2D benchmark and outperforms proprietary LMMs on Visual Haystacks' single‑needle challenge. Code and models are available at: https://github.com/savya08/REN.

Abstract:
Reliability and generalization in deep learning are predominantly studied in the context of image classification. Yet, real‑world applications in safety‑critical domains involve a broader set of semantic tasks, such as semantic segmentation and object detection, which come with a diverse set of dedicated model architectures. To facilitate research towards robust model design in segmentation and detection, our primary objective is to provide benchmarking tools regarding robustness to distribution shifts and adversarial manipulations. We propose the benchmarking tools SEMSEGBENCH and DETECBENCH, along with the most extensive evaluation to date on the reliability and generalization of semantic segmentation and object detection models. In particular, we benchmark 76 segmentation models across four datasets and 61 object detectors across two datasets, evaluating their performance under diverse adversarial attacks and common corruptions. Our findings reveal systematic weaknesses in state‑of‑the‑art models and uncover key trends based on architecture, backbone, and model capacity. SEMSEGBENCH and DETECBENCH are open‑sourced in our GitHub repository (https://github.com/shashankskagnihotri/benchmarking_reliability_generalization) along with our complete set of total 6139 evaluations. We anticipate the collected data to foster and encourage future research towards improved model reliability beyond classification.

Abstract:
Open‑Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open‑world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step‑by‑step visual reasoning framework for open‑vocabulary segmentation, named OpenSeg‑R. The proposed OpenSeg‑R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image‑specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse‑to‑fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg‑R is the first framework to introduce explicit step‑by‑step visual reasoning into OVS. Experimental results demonstrate that OpenSeg‑R significantly outperforms state‑of‑the‑art methods on open‑vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open‑vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning‑guided framework in improving both segmentation precision and interpretability. Our code is publicly available at https://github.com/Hanzy1996/OpenSeg‑R.

Abstract:
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed‑loop multi‑agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi‑agent interaction in automated end‑to‑end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.

Abstract:
Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point‑level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding box by adding scaling, translation, and rotation. In this paper, we propose Sketchy‑3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding‑box supervisions. Specifically, we first propose an adaptive box‑to‑point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse‑to‑fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high‑quality instances through joint training. Extensive experiments show that our method achieves state‑of‑the‑art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes. Code is available at https://github.com/dengq7/Sketchy‑3DIS.

Abstract:
Semantic segmentation models trained on synthetic data often perform poorly on real‑world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class‑wise Adaptive Instance Normalization and Cross‑Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross‑attention maps based on feature similarity, preventing artifacts in regions with weak cross‑attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class‑aware diffusion‑based style transfer effectively bridges the synthetic‑to‑real domain gap even with minimal target domain data, advancing robust perception systems for challenging real‑world applications. The source code is available at: https://github.com/echigot/cactif.

Abstract:
With recent breakthroughs in large‑scale modeling, the Segment Anything Model (SAM) has demonstrated significant potential in a variety of visual applications. However, due to the lack of underwater domain expertise, SAM and its variants face performance limitations in end‑to‑end underwater instance segmentation tasks, while their higher computational requirements further hinder their application in underwater scenarios. To address this challenge, we propose a large‑scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel‑level annotations for 10 categories. Then, we introduce UWSAM, an efficient model designed for automatic and accurate segmentation of underwater instances. UWSAM efficiently distills knowledge from the SAM ViT‑Huge image encoder into the smaller ViT‑Small image encoder via the Mask GAT‑based Underwater Knowledge Distillation (MG‑UKD) method for effective visual representation learning. Furthermore, we design an End‑to‑end Underwater Prompt Generator (EUPG) for UWSAM, which automatically generates underwater prompts instead of explicitly providing foreground points or boxes as prompts, thus enabling the network to locate underwater instances accurately for efficient segmentation. Comprehensive experimental results show that our model is effective, achieving significant performance improvements over state‑of‑the‑art methods on multiple underwater instance datasets. Datasets and codes are available at https://github.com/LiamLian0727/UIIS10K.

Abstract:
By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general‑purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category‑agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero‑shot generalization, accurately segmenting objects of types and styles unseen in finetuning. This holds even for MAE, which is pretrained on unlabeled ImageNet‑1K only. When evaluated on unseen object types and styles, our best‑performing models closely approach the heavily supervised SAM, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet‑scale pretraining. Please see our website for additional qualitative figures, code, and a demo.

Abstract:
Semantic segmentation (SS) of RSIs enables the fine‑grained interpretation of surface features, making it a critical task in RS analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating hierarchical feature extraction and improving segmentation performance across diverse modalities. As data scale and model capacity have increased, DL‑based RSISS has undergone a structural evolution from pixel‑level and patch‑based classification to tile‑level, end‑to‑end segmentation, and, more recently, to image‑level modelling with vision foundation models. However, existing reviews often focus on individual components, such as supervision strategies or fusion stages, and lack a unified operational perspective aligned with segmentation granularity and the training/inference pipeline. This paper provides a comprehensive review by organizing DL‑based RSISS into a pixel‑patch‑tile‑image hierarchy, covering early pixel‑based methods, prevailing patch‑based and tile‑based techniques, and emerging image‑based approaches. This review offers a holistic and structured understanding of DL‑based RSISS, highlighting representative datasets, comparative insights, and open challenges related to data scale, model efficiency, domain robustness, and multimodal integration. Furthermore, to facilitate reproducible research, curated code collections are provided at: https://github.com/quanweiliu/PatchwiseClsFra and https://github.com/quanweiliu/TilewiseSegFra.

Abstract:
This paper introduces ReservoirTTA, a novel plug‑in framework designed for prolonged test‑time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain‑specialized models ‑‑ an adaptive test‑time model ensemble ‑‑ that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain‑specific adaptation. This multi‑model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter‑domain interference, and error accumulation, ensuring robust and stable performance on sustained non‑stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug‑in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on scene‑level corruption benchmarks (ImageNet‑C, CIFAR‑10/100‑C), object‑level style shifts (DomainNet‑126, PACS), and semantic segmentation (Cityscapes‑>ACDC) covering recurring and continuously evolving domain shifts ‑‑ show that ReservoirTTA substantially improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state‑of‑the‑art methods. Our code is publicly available at https://github.com/LTS5/ReservoirTTA.

Abstract:
This paper focus on few‑shot object detection~(FSOD) and instance segmentation~(FSIS), which requires a model to quickly adapt to novel classes with a few labeled instances. The existing methods severely suffer from bias classification because of the missing label issue which naturally exists in an instance‑level few‑shot scenario and is first formally proposed by us. Our analysis suggests that the standard classification head of most FSOD or FSIS models needs to be decoupled to mitigate the bias classification. Therefore, we propose an embarrassingly simple but effective method that decouples the standard classifier into two heads. Then, these two individual heads are capable of independently addressing clear positive samples and noisy negative samples which are caused by the missing label. In this way, the model can effectively learn novel classes while mitigating the effects of noisy negative samples. Without bells and whistles, our model without any additional computation cost and parameters consistently outperforms its baseline and state‑of‑the‑art by a large margin on PASCAL VOC and MS‑COCO benchmarks for FSOD and FSIS tasks. The Code is available at https://csgaobb.github.io/Projects/DCFS.

Abstract:
Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge‑suitable networks. However, conventional KD frameworks rely on pre‑trained high‑capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher‑free distillation (self‑distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher‑free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra‑class patch swap augmentation. This augmentation simulates a teacher‑student dynamic within a single model by generating pairs of intra‑class samples with varying confidence levels, and then applying instance‑to‑instance distillation to align their predictive distributions. Our method is conceptually simple, model‑agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self‑distillation baselines and conventional teacher‑based KD approaches. These results suggest that the success of self‑distillation could hinge on the design of the augmentation itself. Our codes are available at https://github.com/hchoi71/Intra‑class‑Patch‑Swap.

Abstract:
Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce Long‑RVOS, a large‑scale benchmark for long‑term referring video object segmentation. Long‑RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance‑reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per‑frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state‑of‑the‑art methods on Long‑RVOS. The results show that current approaches struggle severely with the long‑video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local‑to‑global architecture to capture both short‑term dynamics and long‑term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long‑term scenarios. We hope that Long‑RVOS and our baseline can drive future RVOS research towards tackling more realistic and long‑form videos.

Abstract:
The proliferation of multi‑source remote sensing data has propelled the development of deep learning for dense prediction, yet significant challenges in data and task unification persist. Current deep learning architectures for remote sensing are fundamentally rigid. They are engineered for fixed input‑output configurations, restricting their adaptability to the heterogeneous spatial, temporal, and spectral dimensions inherent in real‑world data. Furthermore, these models neglect the intrinsic correlations among semantic segmentation, binary change detection, and semantic change detection, necessitating the development of distinct models or task‑specific decoders. This paradigm is also constrained to a predefined set of output semantic classes, where any change to the classes requires costly retraining. To overcome these limitations, we introduce the Spatial‑Temporal‑Spectral Unified Network (STSUN) for unified modeling. STSUN can adapt to input and output data with arbitrary spatial sizes, temporal lengths, and spectral bands by leveraging their metadata for a unified representation. Moreover, STSUN unifies disparate dense prediction tasks within a single architecture by conditioning the model on trainable task embeddings. Similarly, STSUN facilitates flexible prediction across multiple set of semantic categories by integrating trainable category embeddings as metadata. Extensive experiments on multiple datasets with diverse Spatial‑Temporal‑Spectral configurations in multiple scenarios demonstrate that a single STSUN model effectively adapts to heterogeneous inputs and outputs, unifying various dense prediction tasks and diverse semantic class predictions. The proposed approach consistently achieves state‑of‑the‑art performance, highlighting its robustness and generalizability for complex remote sensing applications.

Abstract:
Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN‑based Transformers primarily focus on single‑image tasks, emphasizing spatial features while not effectively leveraging SNNs' efficiency in video‑based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike‑driven video Transformer, featuring linear temporal complexity \mathcalO(T). Specifically, we design a spike‑driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real‑valued attention to spike‑driven attention. Building on SDHA, we further analyze various spike‑driven space‑time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state‑of‑the‑art (SOTA) performance compared to existing SNN approaches, with over 15% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN‑based methods while offering significant efficiency gains, achieving × 16, × 10 and × 5 improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer

Abstract:
Airborne laser scanning (ALS) point cloud semantic segmentation is a fundamental task for large‑scale 3D scene understanding. Fixed models deployed in real‑world scenarios often suffer from performance degradation due to continuous domain shifts caused by environmental and sensor changes. Continuous Test‑Time Adaptation (CTTA) enables adaptation to evolving unlabeled domains, but its application to ALS point clouds remains underexplored, hindered by the lack of benchmarks and the risks of catastrophic forgetting and error accumulation. To address these challenges, we propose APCoTTA (ALS Point cloud Continuous Test‑Time Adaptation), a novel CTTA framework tailored for ALS point cloud semantic segmentation. APCoTTA consists of three key components. First, we adapt a gradient‑driven layer selection mechanism for ALS point clouds, selectively updating low‑confidence layers while freezing stable ones to preserve source knowledge and mitigate catastrophic forgetting. Second, an entropy‑based consistency loss discards unreliable samples and enforces consistency regularization solely on reliable ones, effectively reducing error accumulation and improving adaptation stability. Third, a random parameter interpolation mechanism stochastically blends adapted parameters with source model parameters, further balancing target adaptation and source knowledge retention. Finally, we construct two benchmarks, ISPRSC and H3DC, to address the lack of CTTA benchmarks for ALS point cloud segmentation. Extensive experiments demonstrate that APCoTTA achieves superior performance on both benchmarks, improving mIoU by approximately 9% and 14% over direct inference. The new benchmarks and code are available at https://github.com/Gaoyuan2/APCoTTA.

Abstract:
Zero‑ and few‑shot visual anomaly segmentation relies on powerful vision‑language models that detect unseen anomalies using manually designed textual prompts. However, visual representations are inherently independent of language. In this paper, we explore the potential of a pure visual foundation model as an alternative to widely used vision‑language models for universal visual anomaly segmentation. We present a novel paradigm that unifies anomaly segmentation into change segmentation. This paradigm enables us to leverage large‑scale synthetic image pairs, featuring object‑level and local region changes, derived from existing image datasets, which are independent of target anomaly datasets. We propose a one‑prompt Meta‑learning framework for Universal Anomaly Segmentation (MetaUAS) that is trained on this synthetic dataset and then generalizes well to segment any novel or unseen visual anomalies in the real world. To handle geometrical variations between prompt and query images, we propose a soft feature alignment module that bridges paired‑image change perception and single‑image semantic segmentation. This is the first work to achieve universal anomaly segmentation using a pure vision model without relying on special anomaly detection datasets and pre‑trained visual‑language models. Our method effectively and efficiently segments any anomalies with only one normal image prompt and enjoys training‑free without guidance from language. Our MetaUAS significantly outperforms previous zero‑shot, few‑shot, and even full‑shot anomaly segmentation methods. The code and pre‑trained models are available at https://github.com/gaobb/MetaUAS.

Abstract:
Diffusion‑based image super‑resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single‑step inference. To address this limitation, we propose SAMSR, a semantic‑guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM‑Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel‑wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel‑level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel‑wise semantic weights between predictions and ground truth. Extensive experiments on both real‑world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images. Our code is released at https://github.com/Liu‑Zihang/SAMSR.

Abstract:
Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open‑vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land‑cover). We evaluate 5 open‑vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi‑class segmentation through a confidence‑based mask merging strategy. Our extensive experiments reveal that open‑vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high variability depending on the input prompt. Through comprehensive quantitative and qualitative analysis, we identify the strengths and weaknesses of both prompting modalities, providing valuable insights to guide future research in vision foundation models for segmentation tasks.

Abstract:
Cotton is a major cash crop in the United States, with the country being a leading global producer and exporter. Nearly all U.S. cotton is grown in the Cotton Belt, spanning 17 states in the southern region. Harvesting remains a critical yet challenging stage, impacted by the use of costly, environmentally harmful defoliants and heavy, expensive cotton pickers. These factors contribute to yield loss, reduced fiber quality, and soil compaction, which collectively threaten long‑term sustainability. To address these issues, this study proposes a lightweight, small‑scale, vision‑guided autonomous robotic cotton picker as an alternative. An autonomous system, built on Clearpath's Husky platform and integrated with the CottonEye perception system, was developed and tested in the Gazebo simulation environment. A virtual cotton field was designed to facilitate autonomous navigation testing. The navigation system used Global Positioning System (GPS) and map‑based guidance, assisted by an RGBdepth camera and a YOLOv8nseg instance segmentation model. The model achieved a mean Average Precision (mAP) of 85.2%, a recall of 88.9%, and a precision of 93.0%. The GPS‑based approach reached a 100% completion rate (CR) within a (5e‑6)^\circ threshold, while the map‑based method achieved a 96.7% CR within a 0.25 m threshold. The developed Robot Operating System (ROS) packages enable robust simulation of autonomous cotton picking, offering a scalable baseline for future agricultural robotics. CottonSim code and datasets are publicly available on GitHub: https://github.com/imtheva/CottonSim

Abstract:
Deep learning (DL) models are widely used in real‑world applications but remain vulnerable to distribution shifts, especially due to weather and lighting changes. Collecting diverse real‑world data for testing the robustness of DL models is resource‑intensive, making synthetic corruptions an attractive alternative for robustness testing. However, are synthetic corruptions a reliable proxy for real‑world corruptions? To answer this, we conduct the largest benchmarking study on semantic segmentation models, comparing performance on real‑world corruptions and synthetic corruptions datasets. Our results reveal a strong correlation in mean performance, supporting the use of synthetic corruptions for robustness evaluation. We further analyze corruption‑specific correlations, providing key insights to understand when synthetic corruptions succeed in representing real‑world corruptions. Open‑source Code: https://github.com/shashankskagnihotri/benchmarking_robustness/tree/segmentation_david/semantic_segmentation

Abstract:
Dense visual prediction tasks have been constrained by their reliance on predefined categories, limiting their applicability in real‑world scenarios where visual concepts are unbounded. While Vision‑Language Models (VLMs) like CLIP have shown promise in open‑vocabulary tasks, their direct application to dense prediction often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self‑attention module to obtain ``content'' and ``context'' features respectively. The ``content'' features are aligned with image crop representations to improve local discriminability, while ``context'' features learn to retain the spatial correlations under the guidance of vision foundation models, such as DINO. Extensive experiments demonstrate that DeCLIP significantly outperforms existing methods across multiple open‑vocabulary dense prediction tasks, including object detection and semantic segmentation. Code is available at \textcolormagentahttps://github.com/xiaomoguhz/DeCLIP.

Abstract:
Panoramic imaging enables capturing 360° images with an ultra‑wide Field‑of‑View (FoV) for dense omnidirectional perception, which is critical to applications, such as autonomous driving and augmented reality, etc. However, current panoramic semantic segmentation methods fail to identify outliers, and pinhole Out‑of‑distribution Segmentation (OoS) models perform unsatisfactorily in the panoramic domain due to pixel distortions and background clutter. To address these issues, we introduce a new task, Panoramic Out‑of‑distribution Segmentation (PanOoS), with the aim of achieving comprehensive and safe scene understanding. Furthermore, we propose the first solution, POS, which adapts to the characteristics of panoramic images through text‑guided prompt distribution learning. Specifically, POS integrates a disentanglement strategy designed to materialize the cross‑domain generalization capability of CLIP. The proposed Prompt‑based Restoration Attention (PRA) optimizes semantic decoding by prompt guidance and self‑adaptive correction, while Bilevel Prompt Distribution Learning (BPDL) refines the manifold of per‑pixel mask embeddings via semantic prototype supervision. Besides, to compensate for the scarcity of PanOoS datasets, we establish two benchmarks: DenseOoS, which features diverse outliers in complex environments, and QuadOoS, captured by a quadruped robot with a panoramic annular lens system. Extensive experiments demonstrate superior performance of POS, with AuPRC improving by 34.25% and FPR95 decreasing by 21.42% on DenseOoS, outperforming state‑of‑the‑art pinhole‑OoS methods. Moreover, POS achieves leading closed‑set segmentation capabilities and advances the development of panoramic understanding. Code and datasets will be available at https://github.com/MengfeiD/PanOoS.

Abstract:
Event cameras capture microsecond‑level motion cues that complement RGB sensors. However, the prevailing paradigm of treating RGB‑Event perception as a fusion problem is ill‑posed, as it ignores the intrinsic (i) Spatiotemporal and (ii) Modal Misalignment, unlike other RGB‑X sensing domains. To tackle these limitations, we recast RGB‑Event segmentation from fusion to registration. We propose BRENet, a novel flow‑guided bidirectional framework that adaptively matches correspondence between the asymmetric modalities. Specifically, it leverages temporally aligned optical flows as a coarse‑grained guide, along with fine‑grained event temporal features, to generate precise forward and backward pixel pairings for registration. This pairing mechanism converts the inherent motion lag into terms governed by flow estimation error, bridging modality gaps. Moreover, we introduce Motion‑Enhanced Event Tensor (MET), a new representation that transforms sparse event streams into a dense, temporally coherent form. Extensive experiments on four large‑scale datasets validate our approach, establishing flow‑guided registration as a promising direction for RGB‑Event segmentation. Our code is available at: https://github.com/zyaocoder/BRENet.

Abstract:
Camouflaged object segmentation presents unique challenges compared to traditional segmentation tasks, primarily due to the high similarity in patterns and colors between camouflaged objects and their backgrounds. Effective solutions to this problem have significant implications in critical areas such as pest control, defect detection, and lesion segmentation in medical imaging. Prior research has predominantly emphasized supervised or unsupervised pre‑training methods, leaving zero‑shot approaches significantly underdeveloped. Existing zero‑shot techniques commonly utilize the Segment Anything Model (SAM) in automatic mode or rely on vision‑language models to generate cues for segmentation; however, their performances remain unsatisfactory, due to the similarity of the camouflaged object and the background. This work studies how to avoid training by integrating large pre‑trained models like SAM‑2 and Owl‑v2 with temporal information into a modular pipeline. Evaluated on the MoCA‑Mask dataset, our approach achieves outstanding performance improvements, significantly outperforming existing zero‑shot methods by raising the F‑measure (F_β^w) from 0.296 to 0.628. Our approach also surpasses supervised methods, increasing the F‑measure from 0.476 to 0.628. Additionally, evaluation on the MoCA‑Filter dataset demonstrates an increase in the success rate from 0.628 to 0.697 when compared with FlowSAM, a supervised transfer method. A thorough ablation study further validates the individual contributions of each component. Besides our main contributions, we also highlight inconsistencies in previous work regarding metrics and settings. Code can be found in https://github.com/weathon/vcos.

Abstract:
This paper addresses the challenge of mapping polygonal buildings from remote sensing images and introduces a novel algorithm, the Global Collinearity‑aware Polygonizer (GCP). GCP, built upon an instance segmentation framework, processes binary masks produced by any instance segmentation model. The algorithm begins by collecting polylines sampled along the contours of the binary masks. These polylines undergo a refinement process using a transformer‑based regression module to ensure they accurately fit the contours of the targeted building instances. Subsequently, a collinearity‑aware polygon simplification module simplifies these refined polylines and generate the final polygon representation. This module employs dynamic programming technique to optimize an objective function that balances the simplicity and fidelity of the polygons, achieving globally optimal solutions. Furthermore, the optimized collinearity‑aware objective is seamlessly integrated into network training, enhancing the cohesiveness of the entire pipeline. The effectiveness of GCP has been validated on two public benchmarks for polygonal building mapping. Further experiments reveal that applying the collinearity‑aware polygon simplification module to arbitrary polylines, without prior knowledge, enhances accuracy over traditional methods such as the Douglas‑Peucker algorithm. This finding underscores the broad applicability of GCP. The code for the proposed method will be made available at https://github.com/zhu‑xlab.

Abstract:
The increasing accessibility of remotely sensed data and their potential to support large‑scale decision‑making have driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models rely on large datasets. However, the common assumption that larger training datasets lead to better performance tends to overlook issues related to data redundancy, noise, and the computational cost of processing massive datasets. Effective solutions must therefore consider not only the quantity but also the quality of data. Towards this, in this paper, we introduce six basic core‑set selection approaches ‑‑ that rely on imagery only, labels only, or a combination of both ‑‑ and investigate whether they can identify high‑quality subsets of data capable of maintaining ‑‑ or even surpassing ‑‑ the performance achieved when using full datasets for remote sensing semantic segmentation. We benchmark such approaches against two traditional baselines on three widely used land‑cover classification datasets (DFC2022, Vaihingen, and Potsdam) using two different architectures (SegFormer and U‑Net), thus establishing a general baseline for future works. Our experiments show that all proposed methods consistently outperform the baselines across multiple subset sizes, with some approaches even selecting core sets that surpass training on all available data. Notably, on DFC2022, a selected subset comprising only 25% of the training data yields slightly higher SegFormer performance than training with the entire dataset. This result shows the importance and potential of data‑centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data‑centric‑rs‑classification/.

Abstract:
Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade‑offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high‑resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm‑shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba‑based methodologies in remote sensing, systematically analyzing about 120 Mamba‑based remote sensing studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro‑architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro‑architectural integrations, including CNN‑Transformer‑Mamba hybrids and frequency‑domain adaptations, (iv) rigorous benchmarking against state‑of‑the‑art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM‑based methods. We curate an open‑source repository (https://github.com/BaoBao0926/Awesome‑Mamba‑in‑Remote‑Sensing) to foster community‑driven advancements.

Abstract:
Open‑vocabulary 3D panoptic segmentation has recently emerged as a significant trend. Top‑performing methods currently integrate 2D segmentation with geometry‑aware 3D primitives. However, the advantage would be lost without high‑fidelity 3D point clouds, such as methods based on Neural Radiance Field (NeRF). These methods are limited by the insufficient capacity to maintain consistency across partial observations. To address this, recent works have utilized contrastive loss or cross‑view association pre‑processing for view consensus. In contrast to them, we present Cues3D, a compact approach that relies solely on NeRF instead of pre‑associations. The core idea is that NeRF's implicit 3D field inherently establishes a globally consistent geometry, enabling effective object distinction without explicit cross‑view supervision. We propose a three‑phase training framework for NeRF, initialization‑disambiguation‑refinement, whereby the instance IDs are corrected using the initially‑learned knowledge. Additionally, an instance disambiguation method is proposed to match NeRF‑rendered 3D masks and ensure globally unique 3D instance identities. With the aid of Cues3D, we obtain highly consistent and unique 3D instance ID for each object across views with a balanced version of NeRF. Our experiments are conducted on ScanNet v2, ScanNet200, ScanNet++, and Replica datasets for 3D instance, panoptic, and semantic segmentation tasks. Cues3D outperforms other 2D image‑based methods and competes with the latest 2D‑3D merging based methods, while even surpassing them when using additional 3D point clouds. The code link could be found in the appendix and will be released on \hrefhttps://github.com/mRobotit/Cues3Dgithub

Abstract:
In recent studies, numerous previous works emphasize the importance of semantic segmentation of LiDAR data as a critical component to the development of driver‑assistance systems and autonomous vehicles. However, many state‑of‑the‑art methods are tested on outdated, lower‑resolution LiDAR sensors and struggle with real‑time constraints. This study introduces a novel semantic segmentation framework tailored for modern high‑resolution LiDAR sensors that addresses both accuracy and real‑time processing demands. We propose a novel LiDAR dataset collected by a cutting‑edge automotive 128 layer LiDAR in urban traffic scenes. Furthermore, we propose a semantic segmentation method utilizing surface normals as strong input features. Our approach is bridging the gap between cutting‑edge research and practical automotive applications. Additionaly, we provide a Robot Operating System (ROS2) implementation that we operate on our research vehicle. Our dataset and code are publicly available: https://github.com/kav‑institute/SemanticLiDAR.

Abstract:
We propose a result‑level category‑specific fusion architecture called ClassWise‑CRF. This architecture employs a two‑stage process: first, it selects expert networks that perform well in specific categories from a pool of candidate networks using a greedy algorithm; second, it integrates the segmentation predictions of these selected networks by adaptively weighting their contributions based on their segmentation performance in each category. Inspired by Conditional Random Field (CRF), the ClassWise‑CRF architecture treats the segmentation predictions from multiple networks as confidence vector fields. It leverages segmentation metrics (such as Intersection over Union) from the validation set as priors and employs an exponential weighting strategy to fuse the category‑specific confidence scores predicted by each network. This fusion method dynamically adjusts the weights of each network for different categories, achieving category‑specific optimization. Building on this, the architecture further optimizes the fused results using unary and pairwise potentials in CRF to ensure spatial consistency and boundary accuracy. To validate the effectiveness of ClassWise‑CRF, we conducted experiments on two remote sensing datasets, LoveDA and Vaihingen, using eight classic and advanced semantic segmentation networks. The results show that the ClassWise‑CRF architecture significantly improves segmentation performance: on the LoveDA dataset, the mean Intersection over Union (mIoU) metric increased by 1.00% on the validation set and by 0.68% on the test set; on the Vaihingen dataset, the mIoU improved by 0.87% on the validation set and by 0.91% on the test set. These results fully demonstrate the effectiveness and generality of the ClassWise‑CRF architecture in semantic segmentation of remote sensing images. The full code is available at https://github.com/zhuqinfeng1999/ClassWise‑CRF.

Abstract:
Table structure recognition is a key task in document analysis. However, the geometric deformation in deformed tables causes a weak correlation between content information and structure, resulting in downstream tasks not being able to obtain accurate content information. To obtain fine‑grained spatial coordinates of cells, we propose the OG‑HFYOLO model, which enhances the edge response by Gradient Orientation‑aware Extractor, combines a Heterogeneous Kernel Cross Fusion module and a scale‑aware loss function to adapt to multi‑scale objective features, and introduces mask‑driven non‑maximal suppression in the post‑processing, which replaces the traditional bounding box suppression mechanism. Furthermore, we also propose a data generator, filling the gap in the dataset for fine‑grained deformation table cell spatial coordinate localization, and derive a large‑scale dataset named Deformation Wired Table (DWTAL). Experiments show that our proposed model demonstrates excellent segmentation accuracy on all mainstream instance segmentation models. The dataset and the source code are open source: https://github.com/justliulong/OGHFYOLO.

Abstract:
By mapping sites at large scales using remotely sensed data, archaeologists can generate unique insights into long‑term demographic trends, inter‑regional social networks, and past adaptations to climate change. Remote sensing surveys complement field‑based approaches, and their reach can be especially great when combined with deep learning and computer vision techniques. However, conventional supervised deep learning methods face challenges in annotating fine‑grained archaeological features at scale. While recent vision foundation models have shown remarkable success in learning large‑scale remote sensing data with minimal annotations, most off‑the‑shelf solutions are designed for RGB images rather than multi‑spectral satellite imagery, such as the 8‑band data used in our study. In this paper, we introduce DeepAndes, a transformer‑based vision foundation model trained on three million multi‑spectral satellite images, specifically tailored for Andean archaeology. DeepAndes incorporates a customized DINOv2 self‑supervised learning algorithm optimized for 8‑band multi‑spectral imagery, marking the first foundation model designed explicitly for the Andes region. We evaluate its image understanding performance through imbalanced image classification, image instance retrieval, and pixel‑level semantic segmentation tasks. Our experiments show that DeepAndes achieves superior F1 scores, mean average precision, and Dice scores in few‑shot learning scenarios, significantly outperforming models trained from scratch or pre‑trained on smaller datasets. This underscores the effectiveness of large‑scale self‑supervised pre‑training in archaeological remote sensing. Codes will be available on https://github.com/geopacha/DeepAndes.

Abstract:
Understanding continuous video streams plays a fundamental role in real‑time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low‑latency decisions. To address these challenges, our main contributions are three‑fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre‑trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability. (ii) To train StreamFormer, we propose to unify diverse spatial‑temporal video understanding tasks within a multitask visual‑language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine‑grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real‑time applications.

Abstract:
The long‑tail problem presents a significant challenge to the advancement of semantic segmentation in ultra‑high‑resolution (UHR) satellite imagery. While previous efforts in UHR semantic segmentation have largely focused on multi‑branch network architectures that emphasize multi‑scale feature extraction and fusion, they have often overlooked the importance of addressing the long‑tail issue. In contrast to prior UHR methods that focused on independent feature extraction, we emphasize data augmentation and multimodal feature fusion to alleviate the long‑tail problem. In this paper, we introduce SRMF, a novel framework for semantic segmentation in UHR satellite imagery. Our approach addresses the long‑tail class distribution by incorporating a multi‑scale cropping technique alongside a data augmentation strategy based on semantic reordering and resampling. To further enhance model performance, we propose a multimodal fusion‑based general representation knowledge injection method, which, for the first time, fuses text and visual features without the need for individual region text descriptions, extracting more robust features. Extensive experiments on the URUR, GID, and FBP datasets demonstrate that our method improves mIoU by 3.33%, 0.66%, and 0.98%, respectively, achieving state‑of‑the‑art performance. Code is available at: https://github.com/BinSpa/SRMF.git.

Abstract:
There has long been a belief that high‑level semantics learning can benefit various downstream computer vision tasks. However, in the low‑light image enhancement (LLIE) community, existing methods learn a brutal mapping between low‑light and normal‑light domains without considering the semantic information of different regions, especially in those extremely dark regions that suffer from severe information loss. To address this issue, we propose a new deep semantic prior‑guided framework (DeepSPG) based on Retinex image decomposition for LLIE to explore informative semantic knowledge via a pre‑trained semantic segmentation model and multimodal learning. Notably, we incorporate both image‑level semantic prior and text‑level semantic prior and thus formulate a multimodal learning framework with combinatorial deep semantic prior guidance for LLIE. Specifically, we incorporate semantic knowledge to guide the enhancement process via three designs: an image‑level semantic prior guidance by leveraging hierarchical semantic features from a pre‑trained semantic segmentation model; a text‑level semantic prior guidance by integrating natural language semantic constraints via a pre‑trained vision‑language model; a multi‑scale semantic‑aware structure that facilitates effective semantic feature incorporation. Eventually, our proposed DeepSPG demonstrates superior performance compared to state‑of‑the‑art methods across five benchmark datasets. The implementation details and code are publicly available at https://github.com/Wenyuzhy/DeepSPG.

Abstract:
While sugar beets are stored prior to processing, they lose sugar due to factors such as microorganisms present in adherent soil and excess vegetation. Their automated visual inspection promises to aide in quality assurance and thereby increase efficiency throughout the processing chain of sugar production. In this work, we present a novel high‑quality annotated dataset and two‑stage method for the detection, semantic segmentation and mass estimation of post‑harvest and post‑storage sugar beets in monocular RGB images. We conduct extensive ablation experiments for the detection of sugar beets and their fine‑grained semantic segmentation regarding damages, rot, soil adhesion and excess vegetation. For these tasks, we evaluate multiple image sizes, model architectures and encoders, as well as the influence of environmental conditions. Our experiments show an mAP50‑95 of 98.8 for sugar‑beet detection and an mIoU of 64.0 for the best‑performing segmentation model.

Abstract:
In semantic segmentation, the accuracy of models heavily depends on the high‑quality annotations. However, in many practical scenarios, such as medical imaging and remote sensing, obtaining true annotations is not straightforward and usually requires significant human labor. Relying on human labor often introduces annotation errors, including mislabeling, omissions, and inconsistency between annotators. In the case of remote sensing, differences in procurement time can lead to misaligned ground‑truth annotations. These label errors are not independently distributed, and instead usually appear in spatially connected regions where adjacent pixels are more likely to share the same errors. To address these issues, we propose an approximate Bayesian estimation based on a probabilistic model that assumes training data include label errors, incorporating the tendency for these errors to occur with spatial correlations between adjacent pixels. However, Bayesian inference for such spatially correlated discrete variables is notoriously intractable. To overcome this fundamental challenge, we introduce a novel class of probabilistic models, which we term the ELBO‑Computable Correlated Discrete Distribution (ECCD). By representing the discrete dependencies through a continuous latent Gaussian field with a Kac‑Murdock‑Szegö (KMS) structured covariance, our framework enables scalable and efficient variational inference for problems previously considered computationally prohibitive. Through experiments on multiple segmentation tasks, we confirm that leveraging the spatial correlation of label errors significantly improves performance. Notably, in specific tasks such as lung segmentation, the proposed method achieves performance comparable to training with clean labels under moderate noise levels. Code is available at https://github.com/pfnet‑research/Bayesian_SpatialCorr.

Abstract:
In RGB‑D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images. However, most existing methods overlook the inherent differences in how RGB and depth images express information. Properly distinguishing the processing of RGB and depth images is essential to fully exploiting their unique and significant characteristics. To address this, we propose a novel heterogeneous dual‑branch framework called HDBFormer, specifically designed to handle these modality differences. For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features. For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters. Additionally, we introduce the Modality Information Interaction Module (MIIM), which combines transformers with large kernel convolutions to interact global and local information across modalities efficiently. Extensive experiments show that HDBFormer achieves state‑of‑the‑art performance on the NYUDepthv2 and SUN‑RGBD datasets. The code is available at: https://github.com/Weishuobin/HDBFormer.

Abstract:
Semantic segmentation of SAR images has garnered significant attention in remote sensing due to the immunity of SAR sensors to cloudy weather and light conditions. Nevertheless, SAR imagery lacks detailed information and is plagued by significant speckle noise, rendering the annotation or segmentation of SAR images a formidable task. Recent efforts have resorted to annotating paired optical‑SAR images to generate pseudo‑labels through the utilization of an optical image segmentation network. However, these pseudo‑labels are laden with noise, leading to suboptimal performance in SAR image segmentation. In this study, we introduce a more precise method for generating pseudo‑labels by incorporating semi‑supervised learning alongside a novel image resolution alignment augmentation. Furthermore, we introduce a symmetric cross‑entropy loss to mitigate the impact of noisy pseudo‑labels. Additionally, a bag of training and testing tricks is utilized to generate better land‑cover mapping results. Our experiments on the GRSS data fusion contest indicate the effectiveness of the proposed method, which achieves first place. The code is available at https://github.com/StuLiu/DFC2025Track1.git.

Abstract:
Object detection in satellite‑borne Synthetic Aperture Radar (SAR) imagery holds immense potential in tasks such as urban monitoring and disaster response. However, the inherent complexities of SAR data and the scarcity of annotations present significant challenges in the advancement of object detection in this domain. Notably, the detection of small objects in satellite‑borne SAR images poses a particularly intricate problem, because of the technology's relatively low spatial resolution and inherent noise. Furthermore, the lack of large labelled SAR datasets hinders the development of supervised deep learning‑based object detection models. In this paper, we introduce TRANSAR, a novel self‑supervised end‑to‑end vision transformer‑based SAR object detection model that incorporates masked image pre‑training on an unlabeled SAR image dataset that spans more than 25,700 km\textsuperscript2 ground area. Unlike traditional object detection formulation, our approach capitalises on auxiliary binary semantic segmentation, designed to segregate objects of interest during the post‑tuning, especially the smaller ones, from the background. In addition, to address the innate class imbalance due to the disproportion of the object to the image size, we introduce an adaptive sampling scheduler that dynamically adjusts the target class distribution during training based on curriculum learning and model feedback. This approach allows us to outperform conventional supervised architecture such as DeepLabv3 or UNet, and state‑of‑the‑art self‑supervised learning‑based arhitectures such as DPT, SegFormer or UperNet, as shown by extensive evaluations on benchmark SAR datasets.

Abstract:
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine‑tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth‑aware learnable tokens to continuously decouple domain‑invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi‑layer VFM features and depth‑aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual‑spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code is available at https://github.com/anonymouse‑xzrptkvyqc/DepthForge.

Abstract:
Existing zero‑shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D‑PointZshotS, a geometry‑aware zero‑shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross‑attention mechanism, enriching semantic features with fine‑grained geometric details. To further enhance stability and generalization, we introduce a self‑consistency loss, which enforces feature robustness against point‑wise perturbations. Additionally, we re‑represent visual and semantic features in a shared space, bridging the semantic‑visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real‑world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \hrefhttps://github.com/LexieYang/3D‑PointZshotSGithub.

Abstract:
Given a single labeled example, in‑context segmentation aims to segment corresponding objects. This setting, known as one‑shot segmentation in few‑shot learning, explores the segmentation model's generalization ability and has been applied to various vision tasks, including scene understanding and image/video editing. While recent Segment Anything Models have achieved state‑of‑the‑art results in interactive segmentation, these approaches are not directly applicable to in‑context segmentation. In this work, we propose the Dual Consistency SAM (DC‑SAM) method based on prompt‑tuning to adapt SAM and SAM2 for in‑context segmentation of both images and videos. Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high‑quality visual prompts. When generating a mask prior, we fuse the SAM features to better align the prompt encoder. Then, we design a cycle‑consistent cross‑attention on fused features and initial visual prompts. Next, a dual‑branch design is provided by using the discriminative positive and negative prompts in the prompt encoder. Furthermore, we design a simple mask‑tube training strategy to adopt our proposed dual consistency method into the mask tube. Although the proposed DC‑SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support of SAM2. Given the absence of in‑context segmentation in the video domain, we manually curate and construct the first benchmark from existing video segmentation datasets, named In‑Context Video Object Segmentation (IC‑VOS), to better assess the in‑context capability of the model. Extensive experiments demonstrate that our method achieves 55.5 (+1.4) mIoU on COCO‑20i, 73.0 (+1.1) mIoU on PASCAL‑5i, and a J&F score of 71.52 on the proposed IC‑VOS benchmark. Our source code and benchmark are available at https://github.com/zaplm/DC‑SAM.

Abstract:
We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two‑stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object‑centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real‑world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.

Authors: Henghui Ding, Chang Liu, Nikhila Ravi, Shuting He, Yunchao Wei, Song Bai, Philip Torr, Kehuan Song, Xinglin Xie, Kexin Zhang, Licheng Jiao, Lingling Li, Shuyuan Yang, Xuqiang Cao, Linnan Zhao, Jiaxuan Zhao, Fang Liu, Mengjiao Wang, Junpei Zhang, Xu Liu, Yuting Yang, Mengru Ma, Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang, Tianming Liang, Haichao Jiang, Wei-Shi Zheng, Jian-Fang Hu, Haobo Yuan, Xiangtai Li, Tao Zhang, Lu Qi, Ming-Hsuan Yang

Abstract:
This report provides a comprehensive overview of the 4th Pixel‑level Video Understanding in the Wild (PVUW) Challenge, held in conjunction with CVPR 2025. It summarizes the challenge outcomes, participating methodologies, and future research directions. The challenge features two tracks: MOSE, which focuses on complex scene video object segmentation, and MeViS, which targets motion‑guided, language‑based video segmentation. Both tracks introduce new, more challenging datasets designed to better reflect real‑world scenarios. Through detailed evaluation and analysis, the challenge offers valuable insights into the current state‑of‑the‑art and emerging trends in complex video segmentation. More information can be found on the workshop website: https://pvuw.github.io/.

Abstract:
Accurate medical image segmentation is essential for effective diagnosis and treatment. Previously, PraNet‑V1 was proposed to enhance polyp segmentation by introducing a reverse attention (RA) module that utilizes background information. However, PraNet‑V1 struggles with multi‑class segmentation tasks. To address this limitation, we propose PraNet‑V2, which, compared to PraNet‑V1, effectively performs a broader range of tasks including multi‑class segmentation. At the core of PraNet‑V2 is the Dual‑Supervised Reverse Attention (DSRA) module, which incorporates explicit background supervision, independent background modeling, and semantically enriched attention fusion. Our PraNet‑V2 framework demonstrates strong performance on four polyp segmentation datasets. Additionally, by integrating DSRA to iteratively enhance foreground segmentation results in three state‑of‑the‑art semantic segmentation models, we achieve up to a 1.36% improvement in mean Dice score. Code is available at: https://github.com/ai4colonoscopy/PraNet‑V2/tree/main/binary_seg/jittor.

Abstract:
In this paper, we challenge the conventional practice in Open‑Vocabulary Semantic Segmentation (OVSS) of using averaged class‑wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of <class>, a sketch of a <class>). We investigate the impact of templates for OVSS, and find that for each class, there exist single‑template classifiers‑‑which we refer to as class‑experts‑‑that significantly outperform the conventional averaged classifier. First, to identify these class‑experts, we introduce a novel approach that estimates them without any labeled data or training. By leveraging the class‑wise prediction entropy of single‑template classifiers, we select those yielding the lowest entropy as the most reliable class‑experts. Second, we combine the outputs of class‑experts in a new fusion process. Our plug‑and‑play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering an improvement without the need for additional labels or training. Extensive experiments show that FLOSS consistently enhances state‑of‑the‑art OVSS models, generalizes well across datasets with different distribution shifts, and delivers substantial improvements in low‑data scenarios where only a few unlabeled images are available. Our code is available at https://github.com/yasserben/FLOSS .

Abstract:
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre‑trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix‑attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties‑including scalability, cross‑modal information flow patterns, and visual representation capabilities‑with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross‑modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT‑22B in vision tasks such as semantic segmentation. Code and models are available at https://github.com/bytedance/SAIL.

Abstract:
Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large‑scale pre‑trained models, making semantic annotation time‑consuming and labor‑intensive. To address this challenge, we propose SN‑LiDAR, a method that jointly performs accurate semantic segmentation, high‑quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse‑to‑fine planar‑grid feature representation to extract global features from multi‑frame point clouds and leverage a CNN‑based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI‑360 demonstrate the superiority of SN‑LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large‑scale scenes. Codes will be available on https://github.com/dtc111111/SN‑Lidar.

Abstract:
Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi‑dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM‑derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi‑view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM‑Grounding, a new paradigm that shifts grounding from free‑form VLM queries to a structured reasoning process over the semantic‑rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach's superiority. On the ScanRefer benchmark, DSM‑Grounding achieves a state‑of‑the‑art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F‑mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework's practical utility in real‑world scenarios.

Abstract:
This paper proposes a novel framework utilizing multi‑modal large language models (MLLMs) for referring video object segmentation (RefVOS). Previous MLLM‑based methods commonly struggle with the dilemma between "Ref" and "VOS": they either specialize in understanding a few key frames (global reasoning) or tracking objects on continuous frames (local reasoning), and rely on external VOS or frame selectors to mitigate the other end of the challenge. However, our framework GLUS shows that global and local consistency can be unified into a single video segmentation MLLM: a set of sparse "context frames" provides global information, while a stream of continuous "query frames" conducts local object tracking. This is further supported by jointly training the MLLM with a pre‑trained VOS memory bank to simultaneously digest short‑range and long‑range temporal information. To improve the information efficiency within the limited context window of MLLMs, we introduce object contrastive learning to distinguish hard false‑positive objects and a self‑refined framework to identify crucial frames and perform propagation. By collectively integrating these insights, our GLUS delivers a simple yet effective baseline, achieving new state‑of‑the‑art for MLLMs on the MeViS and Ref‑Youtube‑VOS benchmark. Our project page is at https://glus‑video.github.io/.

Abstract:
Recent advancements in multimodal models have significantly improved vision‑language (VL) alignment in radiology. However, existing approaches struggle to effectively utilize complex radiology reports for learning and offer limited interpretability through attention probability visualizations. To address these challenges, we introduce RadZero, a novel framework for VL alignment in chest X‑ray with zero‑shot multi‑task capability. A key component of our approach is VL‑CABS (Vision‑Language Cross‑Attention Based on Similarity), which aligns text embeddings with local image features for interpretable, fine‑grained VL reasoning. RadZero leverages large language models to extract concise semantic sentences from radiology reports and employs multi‑positive contrastive training to effectively capture relationships between images and multiple relevant textual descriptions. It uses a pre‑trained vision encoder with additional trainable Transformer layers, allowing efficient high‑resolution image processing. By computing similarity between text embeddings and local image patch features, VL‑CABS enables zero‑shot inference with similarity probability for classification, and pixel‑level VL similarity maps for grounding and segmentation. Experimental results on public chest radiograph benchmarks show that RadZero outperforms state‑of‑the‑art methods in zero‑shot classification, grounding, and segmentation. Furthermore, VL similarity map analysis highlights the potential of VL‑CABS for improving explainability in VL alignment. Additionally, qualitative evaluation demonstrates RadZero's capability for open‑vocabulary semantic segmentation, further validating its effectiveness in medical imaging. Code is available at \hrefhttps://github.com/deepnoid‑ai/RadZerohttps://github.com/deepnoid‑ai/RadZero.

Abstract:
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi‑frame image sequences to identify moving objects, single‑image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single‑image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain‑of‑Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross‑fused with visual features from the Segment Anything Model (SAM) and a Vision‑Language Model (VLM), enabling logic‑driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter‑object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi‑frame methods fail. Furthermore, despite the inherent advantage of multi‑frame methods in utilizing temporal information, MovSAM achieves state‑of‑the‑art performance across public MOS benchmarks, reaching 92.5% on J\&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.

Abstract:
Domain Adaptation (DA) and Semi‑supervised Learning (SSL) converge in Semi‑supervised Domain Adaptation (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre‑trained language models to enhance feature representations across domains. Therefore, we propose the first language‑guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision‑language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class‑imbalance challenges in long‑tailed distributions, we introduce class‑balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state‑of‑the‑art (SoTA) methodologies. Code is available: \hrefhttps://github.com/hritam‑98/SemiDAViLGitHub.

Abstract:
Parameter‑Efficient Fine‑Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth‑Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth‑Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth‑Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple‑yet‑effective approaches enable Earth‑Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs' performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth‑Adapter's effectiveness. Compared with baseline Rein, Earth‑Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth‑Adapter.

Abstract:
High‑resolution segmentation is critical for precise disease diagnosis by extracting fine‑grained morphological details. Existing hierarchical encoder‑decoder frameworks have demonstrated remarkable adaptability across diverse medical segmentation tasks. While beneficial, they usually require the huge computation and memory cost when handling large‑size segmentation, which limits their applications in foundation model building and real‑world clinical scenarios. To address this limitation, we propose a holistically efficient framework for high‑resolution medical image segmentation, called HER‑Seg. Specifically, we first devise a computation‑efficient image encoder (CE‑Encoder) to model long‑range dependencies with linear complexity while maintaining sufficient representations. In particular, we introduce the dual‑gated linear attention (DLA) mechanism to perform cascaded token filtering, selectively retaining important tokens while ignoring irrelevant ones to enhance attention computation efficiency. Then, we introduce a memory‑efficient mask decoder (ME‑Decoder) to eliminate the demand for the hierarchical structure by leveraging cross‑scale segmentation decoding. Extensive experiments reveal that HER‑Seg outperforms state‑of‑the‑arts in high‑resolution medical 2D, 3D and video segmentation tasks. In particular, our HER‑Seg requires only 0.59GB training GPU memory and 9.39G inference FLOPs per 1024×1024 image, demonstrating superior memory and computation efficiency. The code is available at https://github.com/xq141839/HER‑Seg.

Abstract:
We present CAT‑V (Caption AnyThing in Video), a training‑free framework for fine‑grained object‑centric video captioning that enables detailed descriptions of user‑selected objects through time. CAT‑V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE‑Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL‑2.5 for generating detailed object‑centric descriptions. Through spatiotemporal visual prompts and chain‑of‑thought reasoning, our framework generates detailed, temporally‑aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT‑V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object‑level precision, enabling fine‑grained, object‑specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at https://github.com/yunlong10/CAT‑V

Abstract:
Recent advances in scene understanding benefit a lot from depth maps because of the 3D geometry information, especially in complex conditions (e.g., low light and overexposed). Existing approaches encode depth maps along with RGB images and perform feature fusion between them to enable more robust predictions. Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. Our goal is to extract the geometry clues from the depth and spatial distances among all the image patch tokens, which will then be used as geometry priors to allocate attention weights in self‑attention. Extensive experiments demonstrate that DFormerv2 exhibits exceptional performance in various RGBD semantic segmentation benchmarks. Code is available at: https://github.com/VCIP‑RGBD/DFormer.

Abstract:
Effective Class Incremental Segmentation (CIS) requires simultaneously mitigating catastrophic forgetting and ensuring sufficient plasticity to integrate new classes. The inherent conflict above often leads to a back‑and‑forth, which turns the objective into finding the balance between the performance of previous~(old) and incremental~(new) classes. To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization~(CoMBO). Within this approach, we present the Query Conflict Reduction module, designed to explicitly refine queries for new classes through lightweight, class‑specific adapters. This module provides an additional branch for the acquisition of new classes while preserving the original queries for distillation. Moreover, we develop two strategies to further mitigate the conflict following the branched structure, i.e., the Half‑Learning Half‑Distillation~(HDHL) over classification probabilities, and the Importance‑Based Knowledge Distillation~(IKD) over query features. HDHL selectively engages in learning for classification probabilities of queries that match the ground truth of new classes, while aligning unmatched ones to the corresponding old probabilities, thus ensuring retention of old knowledge while absorbing new classes via learning negative samples. Meanwhile, IKD assesses the importance of queries based on their matching degree to old classes, prioritizing the distillation of important features and allowing less critical features to evolve. Extensive experiments in Class Incremental Panoptic and Semantic Segmentation settings have demonstrated the superior performance of CoMBO. Project page: https://guangyu‑ryan.github.io/CoMBO.

Abstract:
Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer‑based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi‑granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence‑BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross‑attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli‑git/DocSAM.

Abstract:
Curvilinear structure segmentation (CSS) is essential in various domains, including medical imaging, landscape analysis, industrial surface inspection, and plant analysis. While existing methods achieve high performance within specific domains, their generalizability is limited. On the other hand, large‑scale models such as Segment Anything Model (SAM) exhibit strong generalization but are not optimized for curvilinear structures. Existing adaptations of SAM primarily focus on general object segmentation and lack specialized design for CSS tasks. To bridge this gap, we propose the Universal Curvilinear structure Segmentation (UCS) model, which adapts SAM to CSS tasks while further enhancing its cross‑domain generalization. UCS features a novel encoder architecture integrating a pretrained SAM encoder with two innovations: a Sparse Adapter, strategically inserted to inherit the pre‑trained SAM encoder's generalization capability while minimizing the number of fine‑tuning parameters, and a Prompt Generation module, which leverages Fast Fourier Transform with a high‑pass filter to generate curve‑specific prompts. Furthermore, the UCS incorporates a mask decoder that eliminates reliance on manual interaction through a dual‑compression module: a Hierarchical Feature Compression module, which aggregates the outputs of the sampled encoder to enhance detail preservation, and a Guidance Feature Compression module, which extracts and compresses image‑driven guidance features. Evaluated on a comprehensive multi‑domain dataset, including an in‑house dataset covering eight natural curvilinear structures, UCS demonstrates state‑of‑the‑art generalization and open‑set segmentation performance across medical, engineering, natural, and plant imagery, establishing a new benchmark for universal CSS. The source code is available at https://github.com/kylechuuuuu/UCS.

Abstract:
Vision Foundation Models (VFMs) and Vision‑Language Models (VLMs) have gained traction in Domain Generalized Semantic Segmentation (DGSS) due to their strong generalization capabilities. However, existing DGSS methods often rely exclusively on either VFMs or VLMs, overlooking their complementary strengths. VFMs (e.g., DINOv2) excel at capturing fine‑grained features, while VLMs (e.g., CLIP) provide robust text alignment but struggle with coarse granularity. Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long‑sequence modeling. To address this, we propose MFuser, a novel Mamba‑based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. MFuser consists of two key components: MVFuser, which acts as a co‑adapter to jointly fine‑tune the two models by capturing both sequential and spatial dynamics; and MTEnhancer, a hybrid attention‑Mamba module that refines text embeddings by incorporating image priors. Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead. Extensive experiments demonstrate that MFuser significantly outperforms state‑of‑the‑art DGSS methods, achieving 68.20 mIoU on synthetic‑to‑real and 71.87 mIoU on real‑to‑real benchmarks. The code is available at https://github.com/devinxzhang/MFuser.

Abstract:
Semantic segmentation of high‑resolution remote sensing images plays a crucial role in land‑use monitoring and urban planning. Recent remarkable progress in deep learning‑based methods makes it possible to generate satisfactory segmentation results. However, existing methods still face challenges in adapting network parameters to various land cover distributions and enhancing the interaction between spatial and frequency domain features. To address these challenges, we propose the Adaptive Frequency Enhancement Network (AFENet), which integrates two key components: the Adaptive Frequency and Spatial feature Interaction Module (AFSIM) and the Selective feature Fusion Module (SFM). AFSIM dynamically separates and modulates high‑ and low‑frequency features according to the content of the input image. It adaptively generates two masks to separate high‑ and low‑frequency components, therefore providing optimal details and contextual supplementary information for ground object feature representation. SFM selectively fuses global context and local detailed features to enhance the network's representation capability. Hence, the interactions between frequency and spatial features are further enhanced. Extensive experiments on three publicly available datasets demonstrate that the proposed AFENet outperforms state‑of‑the‑art methods. In addition, we also validate the effectiveness of AFSIM and SFM in managing diverse land cover types and complex scenarios. Our codes are available at https://github.com/oucailab/AFENet.

Abstract:
Rip currents are the leading cause of fatal accidents and injuries on many beaches worldwide, emphasizing the importance of automatically detecting these hazardous surface water currents. In this paper, we address a novel task: rip current instance segmentation. We introduce a comprehensive dataset containing 2,466 images with newly created polygonal annotations for instance segmentation, used for training and validation. Additionally, we present a novel dataset comprising 17 drone videos (comprising about 24K frames) captured at 30 FPS, annotated with both polygons for instance segmentation and bounding boxes for object detection, employed for testing purposes. We train various versions of YOLOv8 for instance segmentation on static images and assess their performance on the test dataset (videos). The best results were achieved by the YOLOv8‑nano model (runnable on a portable device), with an mAP50 of 88.94% on the validation dataset and 81.21% macro average on the test dataset. The results provide a baseline for future research in rip current segmentation. Our work contributes to the existing literature by introducing a detailed, annotated dataset, and training a deep learning model for instance segmentation of rip currents. The code, training details and the annotated dataset are made publicly available at https://github.com/Irikos/rip_currents.

Abstract:
The accurate delineation of agricultural field boundaries from satellite imagery is vital for land management and crop monitoring. However, current methods face challenges due to limited dataset sizes, resolution discrepancies, and diverse environmental conditions. We address this by reformulating the task as instance segmentation and introducing the Field Boundary Instance Segmentation ‑ 22M dataset (FBIS‑22M), a large‑scale, multi‑resolution dataset comprising 672,909 high‑resolution satellite image patches (ranging from 0.25 m to 10 m) and 22,926,427 instance masks of individual fields, significantly narrowing the gap between agricultural datasets and those in other computer vision domains. We further propose Delineate Anything, an instance segmentation model trained on our new FBIS‑22M dataset. Our proposed model sets a new state‑of‑the‑art, achieving a substantial improvement of 88.5% in mAP@0.5 and 103% in mAP@0.5:0.95 over existing methods, while also demonstrating significantly faster inference and strong zero‑shot generalization across diverse image resolutions and unseen geographic regions. Code, pre‑trained models, and the FBIS‑22M dataset are available at https://lavreniuk.github.io/Delineate‑Anything.

Abstract:
Few‑shot point cloud semantic segmentation aims to accurately segment "unseen" new categories in point cloud scenes using limited labeled data. However, pretraining‑based methods not only introduce excessive time overhead but also overlook the local structure representation among irregular point clouds. To address these issues, we propose a pretraining‑free local structure fitting network for few‑shot point cloud semantic segmentation, named TaylorSeg. Specifically, inspired by Taylor series, we treat the local structure representation of irregular point clouds as a polynomial fitting problem and propose a novel local structure fitting convolution, called TaylorConv. This convolution learns the low‑order basic information and high‑order refined information of point clouds from explicit encoding of local geometric structures. Then, using TaylorConv as the basic component, we construct two variants of TaylorSeg: a non‑parametric TaylorSeg‑NN and a parametric TaylorSeg‑PN. The former can achieve performance comparable to existing parametric models without pretraining. For the latter, we equip it with an Adaptive Push‑Pull (APP) module to mitigate the feature distribution differences between the query set and the support set. Extensive experiments validate the effectiveness of the proposed method. Notably, under the 2‑way 1‑shot setting, TaylorSeg‑PN achieves improvements of +2.28% and +4.37% mIoU on the S3DIS and ScanNet datasets respectively, compared to the previous state‑of‑the‑art methods. Our code is available at https://github.com/changshuowang/TaylorSeg.

Abstract:
Supervised deep learning for semantic segmentation has achieved excellent results in accurately identifying anatomical and pathological structures in medical images. However, it often requires large annotated training datasets, which limits its scalability in clinical settings. To address this challenge, semi‑supervised learning is a well‑established approach that leverages both labeled and unlabeled data. In this paper, we introduce a novel semi‑supervised teacher‑student framework for biomedical image segmentation, inspired by the recent success of generative models. Our approach leverages denoising diffusion probabilistic models (DDPMs) to generate segmentation masks by progressively refining noisy inputs conditioned on the corresponding images. The teacher model is first trained in an unsupervised manner using a cycle‑consistency constraint based on noise‑corrupted image reconstruction, enabling it to generate informative semantic masks. Subsequently, the teacher is integrated into a co‑training process with a twin‑student network. The student learns from ground‑truth labels when available and from teacher‑generated pseudo‑labels otherwise, while the teacher continuously improves its pseudo‑labeling capabilities. Finally, to further enhance performance, we introduce a multi‑round pseudo‑label generation strategy that iteratively improves the pseudo‑labeling process. We evaluate our approach on multiple biomedical imaging benchmarks, spanning multiple imaging modalities and segmentation tasks. Experimental results show that our method consistently outperforms state‑of‑the‑art semi‑supervised techniques, highlighting its effectiveness in scenarios with limited annotated data. The code to replicate our experiments can be found at https://github.com/ciampluca/diffusion_semi_supervised_biomedical_image_segmentation

Abstract:
In this paper, we address the challenging problem of open‑world instance segmentation. Existing works have shown that vanilla visual networks are biased toward learning appearance information, \eg texture, to recognize objects. This implicit bias causes the model to fail in detecting novel objects with unseen textures in the open‑world setting. To address this challenge, we propose a learning framework, called view‑Consistent LeaRning (v‑CLR), which aims to enforce the model to learn appearance‑invariant representations for robust instance segmentation. In v‑CLR, we first introduce additional views for each image, where the texture undergoes significant alterations while preserving the image's underlying structure. We then encourage the model to learn the appearance‑invariant representation by enforcing the consistency between object features across different views, for which we obtain class‑agnostic object proposals using off‑the‑shelf unsupervised models that possess strong object‑awareness. These proposals enable cross‑view object feature matching, greatly reducing the appearance dependency while enhancing the object‑awareness. We thoroughly evaluate our method on public benchmarks under both cross‑class and cross‑dataset settings, achieving state‑of‑the‑art performance. Project page: https://visual‑ai.github.io/vclr

Abstract:
This paper addresses the challenge of capturing global temporaldependencies in long video sequences for Video Object Segmentation (VOS). Existing architectures often fail to effectively model these dependencies acrossextended temporal horizons. To overcome this limitation, we introduce GISE‑TTT, anovel architecture that integrates Temporal Transformer (TTT) layers intotransformer‑based frameworks through a co‑designed hierarchical approach.The TTTlayer systematically condenses historical temporal information into hidden states thatencode globally coherent contextual representations. By leveraging multi‑stagecontextual aggregation through hierarchical concatenation, our frameworkprogressively refines spatiotemporal dependencies across network layers. This designrepresents the first systematic empirical evidence that distributing global informationacross multiple network layers is critical for optimal dependency utilization in videosegmentation tasks.Ablation studies demonstrate that incorporating TTT modules athigh‑level feature stages significantly enhances global modeling capabilities, therebyimproving the network's ability to capture long‑range temporal relationships. Extensive experiments on DAVIS 2017 show that GISE‑TTT achieves a 3.2%improvement in segmentation accuracy over the baseline model, providingcomprehensive evidence that global information should be strategically leveragedthroughout the network architecture.The code will be made available at:https://github.com/uuool/GISE‑TTT.

Abstract:
Cell instance segmentation is a fundamental task in digital pathology with broad clinical applications. Recently, vision foundation models, which are predominantly based on Vision Transformers (ViTs), have achieved remarkable success in pathology image analysis. However, their improvements in cell instance segmentation remain limited. A key challenge arises from the tokenization process in ViTs, which substantially reduces the spatial resolution of input images, leading to suboptimal segmentation quality, especially for small and densely packed cells. To address this problem, we propose CellVTA (Cell Vision Transformer with Adapter), a novel method that improves the performance of vision foundation models for cell instance segmentation by incorporating a CNN‑based adapter module. This adapter extracts high‑resolution spatial information from input images and injects it into the ViT through a cross‑attention mechanism. Our method preserves the core architecture of ViT, ensuring seamless integration with pretrained foundation models. Extensive experiments show that CellVTA achieves 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, which significantly outperforms the state‑of‑the‑art cell segmentation methods. Ablation studies confirm the superiority of our approach over other fine‑tuning strategies, including decoder‑only fine‑tuning and full fine‑tuning. Our code and models are publicly available at https://github.com/JieZheng‑ShanghaiTech/CellVTA.

Abstract:
Few‑Shot Semantic Segmentation (FSS), which focuses on segmenting new classes in images using only a limited number of annotated examples, has recently progressed in data‑scarce domains. However, in this work, we show that the existing FSS methods often struggle to generalize to underwater environments. Specifically, the prior features extracted by pre‑trained models used as feature extractors are fragile due to the unique challenges of underwater images. To address this, we propose FSSUWNet, a tailored FSS framework for underwater images with feature enhancement. FSSUWNet exploits the integration of complementary features, emphasizing both low‑level and high‑level image characteristics. In addition to employing a pre‑trained model as the primary encoder, we propose an auxiliary encoder called Feature Enhanced Encoder which extracts complementary features to better adapt to underwater scene characteristics. Furthermore, a simple and effective Feature Alignment Module aims to provide global prior knowledge and align low‑level features with high‑level features in dimensions. Given the scarcity of underwater images, we introduce a cross‑validation dataset version based on the Segmentation of Underwater Imagery dataset. Extensive experiments on public underwater segmentation datasets demonstrate that our approach achieves state‑of‑the‑art performance. For example, our method outperforms the previous best method by 2.8% and 2.6% in terms of the mean Intersection over Union metric for 1‑shot and 5‑shot scenarios in the datasets, respectively. Our implementation is available at https://github.com/lizhh268/FSSUWNet.

Abstract:
Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi‑modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This task has attracted increasing attention in the field of computer vision due to its promising applications in video editing and human‑agent interaction. Recently, ReferDINO has demonstrated promising performance in this task by adapting object‑level vision‑language knowledge from pretrained foundational image models. In this report, we further enhance its capabilities by incorporating the advantages of SAM2 in mask quality and object consistency. In addition, to effectively balance performance between single‑object and multi‑object scenarios, we introduce a conditional mask fusion strategy that adaptively fuses the masks from ReferDINO and SAM2. Our solution, termed ReferDINO‑Plus, achieves 60.43 \(\mathcalJ\&\mathcalF\) on MeViS test set, securing 2nd place in the MeViS PVUW challenge at CVPR 2025. The code is available at: https://github.com/iSEE‑Laboratory/ReferDINO‑Plus.

Abstract:
Underwater image understanding is crucial for both submarine navigation and seabed exploration. However, the low illumination in underwater environments degrades the imaging quality, which in turn seriously deteriorates the performance of underwater semantic segmentation, particularly for outlining the object region boundaries. To tackle this issue, we present UnderWater SegFormer (UWSegFormer), a transformer‑based framework for semantic segmentation of low‑quality underwater images. Firstly, we propose the Underwater Image Quality Attention (UIQA) module. This module enhances the representation of highquality semantic information in underwater image feature channels through a channel self‑attention mechanism. In order to address the issue of loss of imaging details due to the underwater environment, the Multi‑scale Aggregation Attention(MAA) module is proposed. This module aggregates sets of semantic features at different scales by extracting discriminative information from high‑level features,thus compensating for the semantic loss of detail in underwater objects. Finally, during training, we introduce Edge Learning Loss (ELL) in order to enhance the model's learning of underwater object edges and improve the model's prediction accuracy. Experiments conducted on the SUIM and DUT‑USEG (DUT) datasets have demonstrated that the proposed method has advantages in terms of segmentation completeness, boundary clarity, and subjective perceptual details when compared to SOTA methods. In addition, the proposed method achieves the highest mIoU of 82.12 and 71.41 on the SUIM and DUT datasets, respectively. Code will be available at https://github.com/SAWRJJ/UWSegFormer.

Abstract:
Moving object segmentation is a crucial task for achieving a high‑level understanding of visual scenes and has numerous downstream applications. Humans can effortlessly segment moving objects in videos. Previous work has largely relied on optical flow to provide motion cues; however, this approach often results in imperfect predictions due to challenges such as partial motion, complex deformations, motion blur and background distractions. We propose a novel approach for moving object segmentation that combines long‑range trajectory motion cues with DINO‑based semantic features and leverages SAM2 for pixel‑level mask densification through an iterative prompting strategy. Our model employs Spatio‑Temporal Trajectory Attention and Motion‑Semantic Decoupled Embedding to prioritize motion while integrating semantic support. Extensive testing on diverse datasets demonstrates state‑of‑the‑art performance, excelling in challenging scenarios and fine‑grained segmentation of multiple objects. Our code is available at https://motion‑seg.github.io/.

Abstract:
Instance segmentation is essential for augmented reality and virtual reality (AR/VR) as it enables precise object recognition and interaction, enhancing the integration of virtual and real‑world elements for an immersive experience. However, the high computational overhead of segmentation limits its application on resource‑constrained AR/VR devices, causing large processing latency and degrading user experience. In contrast to conventional scenarios, AR/VR users typically focus on only a few regions within their field of view before shifting perspective, allowing segmentation to be concentrated on gaze‑specific areas. This insight drives the need for efficient segmentation methods that prioritize processing instance of interest, reducing computational load and enhancing real‑time performance. In this paper, we present a foveated instance segmentation (FovealSeg) framework that leverages real‑time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. Evaluation results show that FSNet achieves an IoU of 0.56 on ADE20K and 0.54 on LVIS, notably outperforming the baseline. The code is available at https://github.com/SAI‑

Abstract:
Open‑vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine‑tuning for effective real‑world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training‑free, test‑time domain adaptation. SemLA leverages a library of LoRA‑based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad‑hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20‑domain benchmark built over 10 standard datasets demonstrate SemLA's superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open‑vocabulary semantic segmentation.

Abstract:
Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high‑quality and large‑scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text‑to‑Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text‑to‑image and text‑to‑dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross‑modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large‑scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https://github.com/HongkLin/TIDE

Abstract:
Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still, however, a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close‑to‑real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene‑scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse‑to‑fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene‑scale data without relying on any projection or decoupled trained multi‑resolution models, achieving more realistic semantic scene data generation compared to previous state‑of‑the‑art methods. Besides improving 3D semantic scene‑scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene‑scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.

Abstract:
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame‑by‑frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory‑heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long‑range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state‑of‑the‑art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at https://github.com/Ashesham/TV3S.git.

Abstract:
We propose a training‑free method for open‑vocabulary semantic segmentation using Vision‑and‑Language Models (VLMs). Our approach enhances the initial per‑patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch‑to‑patch relationships. Since VLMs are primarily optimized for cross‑modal alignment and not for intra‑modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch‑based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window‑based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state‑of‑the‑art performance among training‑free methods, across a diverse set of datasets. Code: https://github.com/vladan‑stojnic/LPOSS

Abstract:
3D scene understanding has been transformed by open‑vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed‑set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open‑vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real‑world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open‑set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open‑vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.

Abstract:
Traditional open‑access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high‑resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high‑quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self‑supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp in Jaccard on AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp in mAP on CholecT50), surgical tool presence detection (+5.3pp and +10.2pp in mAP on Cholec80 and GraSP), and surgical semantic segmentation (+10.3pp in mDice on CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide. Dataset, code, and models are publicly available at https://github.com/visurg‑ai/LEMON.

Abstract:
Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real‑world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine‑grained and high‑resolution features from the current frame and previous frames, we propose implicit object‑aware fusion (IOF) and explicit object‑aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high‑quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA‑Mask and 19.6 mDice gains with mask prompt on SUN‑SEG‑Hard, with Hiera‑T as the backbone. The code is available at https://github.com/zhoustan/CamSAM2.

Abstract:
Accurate object segmentation is crucial for high‑quality scene understanding in the 3D vision domain. However, 3D segmentation based on 3D Gaussian Splatting (3DGS) struggles with accurately delineating object boundaries, as Gaussian primitives often span across object edges due to their inherent volume and the lack of semantic guidance during training. In order to tackle these challenges, we introduce Clear Object Boundaries for 3DGS Segmentation (COB‑GS), which aims to improve segmentation accuracy by clearly delineating blurry boundaries of interwoven Gaussian primitives within the scene. Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB‑GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary‑adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. For the visual optimization, we rectify the degraded suboptimal texture of the 3DGS scene, particularly along the refined boundary structures. Experimental results show that COB‑GS substantially improves segmentation accuracy and robustness against inaccurate masks from pre‑trained model, yielding clear boundaries while preserving high visual quality. Code is available at https://github.com/ZestfulJX/COB‑GS.

Abstract:
Egocentric open‑surgery videos capture rich, fine‑grained details essential for accurately modeling surgical procedures and human behavior in the operating room. A detailed, pixel‑level understanding of hands and surgical tools is crucial for interpreting a surgeon's actions and intentions. We introduce EgoSurgery‑HTS, a new dataset with pixel‑wise annotations and a benchmark suite for segmenting surgical tools, hands, and interacting tools in egocentric open‑surgery videos. Specifically, we provide a labeled dataset for (1) tool instance segmentation of 14 distinct surgical tools, (2) hand instance segmentation, and (3) hand‑tool segmentation to label hands and the tools they manipulate. Using EgoSurgery‑HTS, we conduct extensive evaluations of state‑of‑the‑art segmentation methods and demonstrate significant improvements in the accuracy of hand and hand‑tool segmentation in egocentric open‑surgery videos compared to existing datasets. The dataset will be released at https://github.com/Fujiry0/EgoSurgery.

Abstract:
Multi‑modal semantic segmentation (MMSS) addresses the limitations of single‑modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real‑world deployment due to variability and uncertainty in multi‑modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire‑Missing Modality (EMM), Random‑Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics‑mIoU^Avg_EMM, mIoU^E_EMM, mIoU^Avg_RMM, and mIoU^E_RMM‑to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at https://github.com/Chenfei‑Liao/Multi‑Modal‑Semantic‑Segmentation‑Robustness‑Benchmark.

Abstract:
High‑resolution semantic segmentation is essential for applications such as image editing, bokeh imaging, AR/VR, etc. Unfortunately, existing datasets often have limited resolution and lack precise mask details and boundaries. In this work, we build a large‑scale, matting‑level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real‑world images, all at 4K resolution. MaSS13K provides high‑quality mask annotations of a number of objects, which are categorized into seven categories: human, vegetation, ground, sky, water, building, and others. MaSS13K features precise masks, with an average mask complexity 20‑50 times higher than existing semantic segmentation datasets. We consequently present a method specifically designed for high‑resolution semantic segmentation, namely MaSSFormer, which employs an efficient pixel decoder that aggregates high‑level semantic features and low‑level texture features across three stages, aiming to produce high‑resolution masks with minimal computational cost. Finally, we propose a new learning paradigm, which integrates the high‑quality masks of the seven given categories with pseudo labels from new classes, enabling MaSSFormer to transfer its accurate segmentation capability to other classes of objects. Our proposed MaSSFormer is comprehensively evaluated on the MaSS13K benchmark together with 14 representative segmentation models. We expect that our meticulously annotated MaSS13K dataset and the MaSSFormer model can facilitate the research of high‑resolution and high‑quality semantic segmentation. Datasets and codes can be found at https://github.com/xiechenxi99/MaSS13K.

Abstract:
Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV) navigation. Such information may be by obtained by estimating depth and semantic segmentation maps of the surrounding environment and for their practical use in autonomous navigation, the procedure must be performed as close to real‑time as possible. In this paper, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low‑altitude unstructured environments. We propose a joint deep‑learning architecture that can perform the two tasks accurately and rapidly, and validate its effectiveness on MidAir and Aeroscapes benchmark datasets. Our joint‑architecture proves to be competitive or superior to the other single and joint architecture methods while performing its task fast predicting 20.2 FPS on a single NVIDIA quadro p5000 GPU and it has a low memory footprint. All codes for training and prediction can be found on this link: https://github.com/Malga‑Vision/Co‑SemDepth

Abstract:
Consistency regularization has prevailed in semi‑supervised semantic segmentation and achieved promising performance. However, existing methods typically concentrate on enhancing the Image‑augmentation based Prediction consistency and optimizing the segmentation network as a whole, resulting in insufficient utilization of potential supervisory information. In this paper, we propose a Multi‑Constraint Consistency Learning (MCCL) approach to facilitate the staged enhancement of the encoder and decoder. Specifically, we first design a feature knowledge alignment (FKA) strategy to promote the feature consistency learning of the encoder from image‑augmentation. Our FKA encourages the encoder to derive consistent features for strongly and weakly augmented views from the perspectives of point‑to‑point alignment and prototype‑based intra‑class compactness. Moreover, we propose a self‑adaptive intervention (SAI) module to increase the discrepancy of aligned intermediate feature representations, promoting Feature‑perturbation based Prediction consistency learning. Self‑adaptive feature masking and noise injection are designed in an instance‑specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state‑of‑the‑art performance. The source code and models are made available at https://github.com/NUST‑Machine‑Intelligence‑Laboratory/MCCL.

Abstract:
LiDAR point cloud semantic segmentation plays a crucial role in autonomous driving. In recent years, semi‑supervised methods have gained popularity due to their significant reduction in annotation labor and time costs. Current semi‑supervised methods typically focus on point cloud spatial distribution or consider short‑term temporal representations, e.g., only two adjacent frames, often overlooking the rich long‑term temporal properties inherent in autonomous driving scenarios. In driving experience, we observe that nearby objects, such as roads and vehicles, remain stable while driving, whereas distant objects exhibit greater variability in category and shape. This natural phenomenon is also captured by LiDAR, which reflects lower temporal sensitivity for nearby objects and higher sensitivity for distant ones. To leverage these characteristics, we propose HiLoTs, which learns high‑temporal sensitivity and low‑temporal sensitivity representations from continuous LiDAR frames. These representations are further enhanced and fused using a cross‑attention mechanism. Additionally, we employ a teacher‑student framework to align the representations learned by the labeled and unlabeled branches, effectively utilizing the large amounts of unlabeled data. Experimental results on the SemanticKITTI and nuScenes datasets demonstrate that our proposed HiLoTs outperforms state‑of‑the‑art semi‑supervised methods, and achieves performance close to LiDAR+Camera multimodal approaches. Code is available on https://github.com/rdlin118/HiLoTs

Abstract:
Semantic segmentation allows autonomous driving cars to understand the surroundings of the vehicle comprehensively. However, it is also crucial for the model to detect obstacles that may jeopardize the safety of autonomous driving systems. Based on our experiments, we find that current uni‑modal anomaly segmentation frameworks tend to produce high anomaly scores for non‑anomalous regions in images. Motivated by this empirical finding, we develop a multi‑modal uncertainty‑based anomaly segmentation framework, named MMRAS+, for autonomous driving systems. MMRAS+ effectively reduces the high anomaly outputs of non‑anomalous classes by introducing text‑modal using the CLIP text encoder. Indeed, MMRAS+ is the first multi‑modal anomaly segmentation solution for autonomous driving. Moreover, we develop an ensemble module to further boost the anomaly segmentation performance. Experiments on RoadAnomaly, SMIYC, and Fishyscapes validation datasets demonstrate the superior performance of our method. The code is available in https://github.com/HengGao12/MMRAS_plus.

Abstract:
Contrastive learning methods in self‑supervised settings have primarily focused on pre‑training encoders, while decoders are typically introduced and trained separately for downstream dense prediction tasks. However, this conventional approach overlooks the potential benefits of jointly pre‑training both encoder and decoder. In this paper, we propose DeCon, an efficient encoder‑decoder self‑supervised learning (SSL) framework that supports joint contrastive pre‑training. We first extend existing SSL architectures to accommodate diverse decoders and their corresponding contrastive losses. Then, we introduce a weighted encoder‑decoder contrastive loss with non‑competing objectives to enable the joint pre‑training of encoder‑decoder architectures. By adapting a contrastive SSL framework for dense prediction, DeCon establishes consistent state‑of‑the‑art performance on most of the evaluated tasks when pre‑trained on Imagenet‑1K, COCO and COCO+. Notably, when pre‑training a ResNet‑50 encoder on COCO dataset, DeCon improves COCO object detection and instance segmentation compared to the baseline framework by +0.37 AP and +0.32 AP, respectively, and boosts semantic segmentation by +1.42 mIoU on Pascal VOC and by +0.50 mIoU on Cityscapes. These improvements generalize across recent backbones, decoders, datasets, and dense tasks beyond segmentation and object detection, and persist in out‑of‑domain scenarios, including limited‑data settings, demonstrating that joint pre‑training significantly enhances representation quality for dense prediction. Code is available at https://github.com/sebquetin/DeCon.git.

Abstract:
Compared with natural images, remote sensing images (RSIs) have the unique characteristic. i.e., larger intraclass variance, which makes semantic segmentation for remote sensing images more challenging. Moreover, existing semantic segmentation models for remote sensing images usually employ a vanilla softmax classifier, which has three drawbacks: (1) non‑direct supervision for the pixel representations during training; (2) inadequate modeling ability of parametric softmax classifiers under large intraclass variance; and (3) opaque process of classification decision. In this paper, we propose a novel classifier (called CenterSeg) customized for RSI semantic segmentation, which solves the abovementioned problems with multiple prototypes, direct supervision under Grassmann manifold, and interpretability strategy. Specifically, for each class, our CenterSeg obtains local class centers by aggregating corresponding pixel features based on ground‑truth masks, and generates multiple prototypes through hard attention assignment and momentum updating. In addition, we introduce the Grassmann manifold and constrain the joint embedding space of pixel features and prototypes based on two additional regularization terms. Especially, during the inference, CenterSeg can further provide interpretability to the model by restricting the prototype as a sample of the training set. Experimental results on three remote sensing segmentation datasets validate the effectiveness of the model. Besides the superior performance, CenterSeg has the advantages of simplicity, lightweight, compatibility, and interpretability. Code is available at https://github.com/xwmaxwma/rssegmentation.

Abstract:
Semantic segmentation from RGB cameras is essential to the perception of autonomous flying vehicles. The stability of predictions through the captured videos is paramount to their reliability and, by extension, to the trustworthiness of the agents. In this paper, we propose a lightweight video semantic segmentation approach‑suited to onboard real‑time inference‑achieving high temporal consistency on aerial data through Semantic Similarity Propagation across frames. SSP temporally propagates the predictions of an efficient image segmentation model with global registration alignment to compensate for camera movements. It combines the current estimation and the prior prediction with linear interpolation using weights computed from the features similarities of the two frames. Because data availability is a challenge in this domain, we propose a consistency‑aware Knowledge Distillation training procedure for sparsely labeled datasets with few annotations. Using a large image segmentation model as a teacher to train the efficient SSP, we leverage the strong correlations between labeled and unlabeled frames in the same training videos to obtain high‑quality supervision on all frames. KD‑SSP obtains a significant temporal consistency increase over the base image segmentation model of 12.5% and 6.7% TC on UAVid and RuralScapes respectively, with higher accuracy and comparable inference speed. On these aerial datasets, KD‑SSP provides a superior segmentation quality and inference speed trade‑off than other video methods proposed for general applications and shows considerably higher consistency. Project page: https://github.com/FraunhoferIVI/SSP.

Abstract:
Traditional transformer‑based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ‑VAE) is 8% lower than continuous‑valued embeddings (e.g. KL‑VAE). Motivated by this, we propose a continuous‑valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image‑to‑embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine‑grained spatial and semantic details. Our key contribution includes a diffusion‑guided autoregressive transformer that learns a continuous semantic embedding space by modeling long‑range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion‑guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero‑shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain‑shifted variants) demonstrate state‑of‑the‑art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance (\approx 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact (\approx 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git

Abstract:
Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes ‑ offering richer spatial representation ‑ remain underexplored. This paper introduces SUM Parts, the first large‑scale dataset for urban textured meshes with part‑level semantic labels, covering about 2.5 km2 with 21 classes. The dataset was created using our own annotation tool, which supports both face‑ and texture‑based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. Our project page is available at https://tudelft3d.github.io/SUMParts/.

Abstract:
Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one‑shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter‑frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi‑source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state‑of‑the‑art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at https://github.com/MedAITech/TCMN.

Abstract:
Few‑shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low‑data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two‑phase framework. First, we learn a few‑shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state‑of‑the‑art few‑shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.

Abstract:
Lifting multi‑view 2D instance segmentation to a radiance field has proven to be effective to enhance 3D understanding. Existing methods rely on direct matching for end‑to‑end lifting, yielding inferior results; or employ a two‑stage solution constrained by complex pre‑ or post‑processing. In this work, we design a new end‑to‑end object‑aware lifting approach, named Unified‑Lift that provides accurate 3D segmentation based on the 3D Gaussian representation. To start, we augment each Gaussian point with an additional Gaussian‑level feature learned using a contrastive loss to encode instance information. Importantly, we introduce a learnable object‑level codebook to account for individual objects in the scene for an explicit object‑level understanding and associate the encoded object‑level features with the Gaussian‑level point features for segmentation predictions. While promising, achieving effective codebook learning is non‑trivial and a naive solution leads to degraded performance. Therefore, we formulate the association learning module and the noisy label filtering module for effective and robust codebook learning. We conduct experiments on three benchmarks: LERF‑Masked, Replica, and Messy Rooms datasets. Both qualitative and quantitative results manifest that our Unified‑Lift clearly outperforms existing methods in terms of segmentation quality and time efficiency. The code is publicly available at \hrefhttps://github.com/Runsong123/Unified‑Lifthttps://github.com/Runsong123/Unified‑Lift.

Abstract:
Self‑supervised learning (SSL) on 3D point clouds has the potential to learn feature representations that can transfer to diverse sensors and multiple downstream perception tasks. However, recent SSL approaches fail to define pretext tasks that retain geometric information such as object pose and scale, which can be detrimental to the performance of downstream localization and geometry‑sensitive 3D scene understanding tasks, such as 3D semantic segmentation and 3D object detection. We propose PSA‑SSL, a novel extension to point cloud SSL that learns object pose and size‑aware (PSA) features. Our approach defines a self‑supervised bounding box regression pretext task, which retains object pose and size information. Furthermore, we incorporate LiDAR beam pattern augmentation on input point clouds, which encourages learning sensor‑agnostic features. Our experiments demonstrate that with a single pretrained model, our light‑weight yet effective extensions achieve significant improvements on 3D semantic segmentation with limited labels across popular autonomous driving datasets (Waymo, nuScenes, SemanticKITTI). Moreover, our approach outperforms other state‑of‑the‑art SSL methods on 3D semantic segmentation (using up to 10 times less labels), as well as on 3D object detection. Our code will be released on https://github.com/TRAILab/PSA‑SSL.

Abstract:
Panoptic segmentation of LiDAR point clouds is fundamental to outdoor scene understanding, with autonomous driving being a primary application. While state‑of‑the‑art approaches typically rely on end‑to‑end deep learning architectures and extensive manual annotations of instances, the significant cost and time investment required for labeling large‑scale point cloud datasets remains a major bottleneck in this field. In this work, we demonstrate that competitive panoptic segmentation can be achieved using only semantic labels, with instances predicted without any training or annotations. Our method outperforms most state‑of‑the‑art supervised methods on standard benchmarks including SemanticKITTI and nuScenes, and outperforms every publicly available method on SemanticKITTI as a drop‑in instance head replacement, while running in real‑time on a single‑threaded CPU and requiring no instance labels. It is fully explainable, and requires no learning or parameter tuning. Alpine combined with state‑of‑the‑art semantic segmentation ranks first on the official panoptic segmentation leaderboard of SemanticKITTI. Code is available at https://github.com/valeoai/Alpine/

Abstract:
Crop yield estimation is a relevant problem in agriculture, because an accurate yield estimate can support farmers' decisions on harvesting or precision intervention. Robots can help to automate this process. To do so, they need to be able to perceive the surrounding environment to identify target objects such as trees and plants. In this paper, we introduce a novel approach to address the problem of hierarchical panoptic segmentation of apple orchards on 3D data from different sensors. Our approach is able to simultaneously provide semantic segmentation, instance segmentation of trunks and fruits, and instance segmentation of trees (a trunk with its fruits). This allows us to identify relevant information such as individual plants, fruits, and trunks, and capture the relationship among them, such as precisely estimate the number of fruits associated to each tree in an orchard. To efficiently evaluate our approach for hierarchical panoptic segmentation, we provide a dataset designed specifically for this task. Our dataset is recorded in Bonn, Germany, in a real apple orchard with a variety of sensors, spanning from a terrestrial laser scanner to a RGB‑D camera mounted on different robots platforms. The experiments show that our approach surpasses state‑of‑the‑art approaches in 3D panoptic segmentation in the agricultural domain, while also providing full hierarchical panoptic segmentation. Our dataset is publicly available at https://www.ipb.uni‑bonn.de/data/hops/. The open‑source implementation of our approach is available at https://github.com/PRBonn/hapt3D.

Abstract:
Optical remote sensing image dehazing presents significant challenges due to its extensive spatial scale and highly non‑uniform haze distribution, which traditional single‑image dehazing methods struggle to address effectively. While Synthetic Aperture Radar (SAR) imagery offers inherently haze‑free reference information for large‑scale scenes, existing SAR‑guided dehazing approaches face two critical limitations: the integration of SAR information often diminishes the quality of haze‑free regions, and the instability of feature quality further exacerbates cross‑modal domain shift. To overcome these challenges, we introduce DehazeMamba, a novel SAR‑guided dehazing network built on a progressive haze decoupling fusion strategy. Our approach incorporates two key innovations: a Haze Perception and Decoupling Module (HPDM) that dynamically identifies haze‑affected regions through optical‑SAR difference analysis, and a Progressive Fusion Module (PFM) that mitigates domain shift through a two‑stage fusion process based on feature quality assessment. To facilitate research in this domain, we present MRSHaze, a large‑scale benchmark dataset comprising 8,000 pairs of temporally synchronized, precisely geo‑registered SAR‑optical images with high resolution and diverse haze conditions. Extensive experiments demonstrate that DehazeMamba significantly outperforms state‑of‑the‑art methods, achieving a 0.73 dB improvement in PSNR and substantial enhancements in downstream tasks such as semantic segmentation. The dataset is available at https://github.com/mmic‑lcl/Datasets‑and‑benchmark‑code.

Abstract:
Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single‑view RGB‑D‑based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material‑agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real‑world benchmarks with acceptable inference time. The demo of our method can be found on https://wang‑haoxiao.github.io/TransDiff/

Abstract:
3D semantic segmentation plays a fundamental and crucial role to understand 3D scenes. While contemporary state‑of‑the‑art techniques predominantly concentrate on elevating the overall performance of 3D semantic segmentation based on general metrics (e.g. mIoU, mAcc, and oAcc), they unfortunately leave the exploration of challenging regions for segmentation mostly neglected. In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. Concretely, we have delineated 3D semantic segmentation errors into four comprehensive categories as well as corresponding evaluation metrics tailored to each. Building upon this categorical framework, we introduce an innovative 3D semantic segmentation network called BFANet that incorporates detailed analysis of semantic boundary features. First, we design the boundary‑semantic module to decouple point cloud features into semantic and boundary features, and fuse their query queue to enhance semantic features with attention. Second, we introduce a more concise and accelerated boundary pseudo‑label calculation algorithm, which is 3.9 times faster than the state‑of‑the‑art, offering compatibility with data augmentation and enabling efficient computation in training. Extensive experiments on benchmark data indicate the superiority of our BFANet model, confirming the significance of emphasizing the four uniquely designed metrics. Code is available at https://github.com/weiguangzhao/BFANet.

Abstract:
3D Gaussian Splatting‑based indoor open‑world free‑view synthesis approaches have shown significant performance with dense input images. However, they exhibit poor performance when confronted with sparse inputs, primarily due to the sparse distribution of Gaussian points and insufficient view supervision. To relieve these challenges, we propose SPC‑GS, leveraging Scene‑layout‑based Gaussian Initialization (SGI) and Semantic‑Prompt Consistency (SPC) Regularization for open‑world free view synthesis with sparse inputs. Specifically, SGI provides a dense, scene‑layout‑based Gaussian distribution by utilizing view‑changed images generated from the video generation model and view‑constraint Gaussian points densification. Additionally, SPC mitigates limited view supervision by employing semantic‑prompt‑based consistency constraints developed by SAM2. This approach leverages available semantics from training views, serving as instructive prompts, to optimize visually overlapping regions in novel views with 2D and 3D consistency constraints. Extensive experiments demonstrate the superior performance of SPC‑GS across Replica and ScanNet benchmarks. Notably, our SPC‑GS achieves a 3.06 dB gain in PSNR for reconstruction quality and a 7.3% improvement in mIoU for open‑world semantic segmentation.

Abstract:
Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor‑intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error‑free instances as a key limitation, we present COIN (COnfidence score‑guided INstance distillation), a novel annotation‑free framework with three key steps: (1) Increasing the sensitivity for the presence of error‑free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance‑level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self‑distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi‑ and weakly‑supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code is available at https://github.com/shjo‑april/COIN.

Abstract:
In recent years, numerous neural network architectures specifically designed for the instance segmentation of nuclei in microscopic images have been released. These models embed nuclei‑specific priors to outperform generic architectures like U‑Nets; however, they require large annotated datasets, which are often not available. Generative models (GANs, diffusion models) have been used to compensate for this by synthesizing training data. These two‑stage approaches are computationally expensive, as first a generative model and then a segmentation model has to be trained. We propose CyclePose, a hybrid framework integrating synthetic data generation and segmentation training. CyclePose builds on a CycleGAN architecture, which allows unpaired translation between microscopy images and segmentation masks. We embed a segmentation model into CycleGAN and leverage a cycle consistency loss for self‑supervision. Without annotated data, CyclePose outperforms other weakly or unsupervised methods on two public datasets. Code is available at https://github.com/jonasutz/CyclePose

Abstract:
Decoding visual stimuli from neural activity is essential for understanding the human brain. While fMRI methods have successfully reconstructed static images, fMRI‑to‑video reconstruction faces challenges due to the need for capturing spatiotemporal dynamics like motion and scene transitions. Recent approaches have improved semantic and perceptual alignment but struggle to integrate coarse fMRI data with detailed visual features. Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub‑tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. This approach simulates the visual cortex's functional specialization, allowing the model to capture diverse video content. In the inference stage, NEURONS generates robust conditioning signals for a pre‑trained text‑to‑video diffusion model to reconstruct the videos. Extensive experiments demonstrate that NEURONS outperforms state‑of‑the‑art baselines, achieving solid improvements in video consistency (26.6%) and semantic‑level accuracy (19.1%). Notably, NEURONS shows a strong functional correlation with the visual cortex, highlighting its potential for brain‑computer interfaces and clinical applications. Code and model weights are available at: https://github.com/xmed‑lab/NEURONS.

Abstract:
With the continuous advancement of human exploration into deep space, intelligent perception and high‑precision segmentation technology for on‑orbit multi‑spacecraft targets have become critical factors for ensuring the success of modern space missions. However, the complex deep space environment, diverse imaging conditions, and high variability in spacecraft morphology pose significant challenges to traditional segmentation methods. This paper proposes SpaceSeg, an innovative vision foundation model‑based segmentation framework with four core technical innovations: First, the Multi‑Scale Hierarchical Attention Refinement Decoder (MSHARD) achieves high‑precision feature decoding through cross‑resolution feature fusion via hierarchical attention. Second, the Multi‑spacecraft Connected Component Analysis (MS‑CCA) effectively resolves topological structure confusion in dense targets. Third, the Spatial Domain Adaptation Transform framework (SDAT) eliminates cross‑domain disparities and resist spatial sensor perturbations through composite enhancement strategies. Finally, a custom Multi‑Spacecraft Segmentation Task Loss Function is created to significantly improve segmentation robustness in deep space scenarios. To support algorithm validation, we construct the first multi‑scale on‑orbit multi‑spacecraft semantic segmentation dataset SpaceES, which covers four types of spatial backgrounds and 17 typical spacecraft targets. In testing, SpaceSeg achieves state‑of‑the‑art performance with 89.87% mIoU and 99.98% mAcc, surpassing existing best methods by 5.71 percentage points. The dataset and code are open‑sourced at https://github.com/Akibaru/SpaceSeg to provide critical technical support for next‑generation space situational awareness systems.

Abstract:
Semantic segmentation is a key technique that enables mobile robots to understand and navigate surrounding environments autonomously. However, most existing works focus on segmenting known objects, overlooking the identification of unknown classes, which is common in real‑world applications. In this paper, we propose a feature‑oriented framework for open‑set semantic segmentation on LiDAR data, capable of identifying unknown objects while retaining the ability to classify known ones. We design a decomposed dual‑decoder network to simultaneously perform closed‑set semantic segmentation and generate distinctive features for unknown objects. The network is trained with multi‑objective loss functions to capture the characteristics of known and unknown objects. Using the extracted features, we introduce an anomaly detection mechanism to identify unknown objects. By integrating the results of close‑set semantic segmentation and anomaly detection, we achieve effective feature‑driven LiDAR open‑set semantic segmentation. Evaluations on both SemanticKITTI and nuScenes datasets demonstrate that our proposed framework significantly outperforms state‑of‑the‑art methods. The source code will be made publicly available at https://github.com/nubot‑nudt/DOSS.

Abstract:
Learning skills in open‑world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self‑supervised learning‑based approach to segment these long videos into a series of semantic‑aware and skill‑consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation‑free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action‑prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open‑world simulator with extensive gameplay videos available online. Our SBD‑generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short‑term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long‑horizon tasks. Our method can leverage the diverse YouTube videos to train instruction‑following agents. The project page can be found in https://craftjarvis.github.io/SkillDiscovery.

Abstract:
Recent advances in self‑supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self‑attention poses a significant barrier to scalability, particularly for large models and high‑resolution images. While the linear‑complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain‑specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self‑supervised pretraining of Mamba‑based RS foundation models using large‑scale, diverse, unlabeled data. RoMA enhances scalability for high‑resolution images through a tailored auto‑regressive learning strategy, incorporating two key innovations: 1) a rotation‑aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi‑scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA‑pretrained Mamba models consistently outperform ViT‑based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.

Abstract:
Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM‑powered pipeline for evaluating OSM solutions called OSMa‑Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state‑of‑the‑art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB‑D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ, and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Project page is available at https://be2rlab.github.io/OSMa‑Bench/.

Abstract:
High‑quality instance and panoptic segmentation has traditionally relied on dense instance‑level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly‑supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text‑to‑image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self‑attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one‑step edge decoder. This design removes the need for per‑image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag‑supervised panoptic segmentation it outperforms point‑supervised baselines by +1.7 PQ without using any instance‑level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Project Page: https://shjo‑april.github.io/TRACE/

Abstract:
Self‑supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily employ objectives like contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models, which demonstrate the potential to capture multi‑grained semantics essential for RS tasks during image generation, remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion‑based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi‑stage, noise‑dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state‑of‑the‑art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1‑score in classification, demonstrating the capacity of diffusion‑based generative foundation models to rival or exceed discriminative GFMs. The source code is available at: https://github.com/yurujaja/SatDiFuser.

Abstract:
The exploration of Bird's‑Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly accommodate all BEV mapping tasks. To address this gap, this paper proposes HierDAMap, a universal and holistic BEV domain adaptation framework with hierarchical perspective priors. Unlike existing research that solely focuses on image‑level learning using prior knowledge, this paper explores the guiding role of perspective prior knowledge across three distinct levels: global, sparse, and instance levels. With these priors, HierDA consists of three essential components, including Semantic‑Guided Pseudo Supervision (SGPS), Dynamic‑Aware Coherence Learning (DACL), and Cross‑Domain Frustum Mixing (CDFM). SGPS constrains the cross‑domain consistency of perspective feature distribution through pseudo labels generated by vision foundation models in 2D space. To mitigate feature distribution discrepancies caused by spatial variations, DACL employs uncertainty‑aware predicted depth as an intermediary to derive dynamic BEV labels from perspective pseudo‑labels, thereby constraining the coarse BEV features derived from corresponding perspective features. CDFM, on the other hand, leverages perspective masks of view frustum to mix multi‑view perspective images from both domains, which guides cross‑domain view transformation and encoding learning through mixed BEV labels. The proposed method is verified on multiple BEV mapping tasks, such as BEV semantic segmentation, high‑definition semantic, and vectorized mapping. The source code will be made publicly available at https://github.com/lynn‑yu/HierDAMap.

Abstract:
Segment Anything Model (SAM) exhibits remarkable zero‑shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post‑training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100x), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution‑based metrics (e.g., MSE) fail to provide such large‑scale clipping. (ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention. To address the above issues, we propose SAQ‑SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment. Specifically, we propose Perceptual‑Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. Furthermore, we propose Prompt‑Aware Reconstruction, which incorporates image‑prompt interactions by leveraging cross‑attention in mask decoder, thus facilitating alignment in both distribution and semantic. Moreover, to ensure the interaction efficiency, we design a layer‑skipping strategy for image tokens in encoder. Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages. For example, when quantizing SAM‑B to 4‑bit, SAQ‑SAM achieves 11.7% higher mAP than the baseline in instance segmentation task.

Abstract:
In tomato greenhouse, phenotypic measurement is meaningful for researchers and farmers to monitor crop growth, thereby precisely control environmental conditions in time, leading to better quality and higher yield. Traditional phenotyping mainly relies on manual measurement, which is accurate but inefficient, more importantly, endangering the health and safety of people. Several studies have explored computer vision‑based methods to replace manual phenotyping. However, the 2D‑based need extra calibration, or cause destruction to fruit, or can only measure limited and meaningless traits. The 3D‑based need extra depth camera, which is expensive and unacceptable for most farmers. In this paper, we propose a non‑contact tomato fruit phenotyping method, titled TomatoScanner, where RGB image is all you need for input. First, pixel feature is extracted by instance segmentation of our proposed EdgeYOLO with preprocessing of individual separation and pose correction. Second, depth feature is extracted by depth estimation of Depth Pro. Third, pixel and depth feature are fused to output phenotype results in reality. We establish self‑built Tomato Phenotype Dataset to test TomatoScanner, which achieves excellent phenotyping on width, height, vertical area and volume, with median relative error of 5.63%, 7.03%, ‑0.64% and 37.06%, respectively. We propose and add three innovative modules ‑ EdgeAttention, EdgeLoss and EdgeBoost ‑ into EdgeYOLO, to enhance the segmentation accuracy on edge portion. Precision and mean Edge Error greatly improve from 0.943 and 5.641% to 0.986 and 2.963%, respectively. Meanwhile, EdgeYOLO keeps lightweight and efficient, with 48.7 M weights size and 76.34 FPS. Codes and datasets: https://github.com/AlexTraveling/TomatoScanner.

Abstract:
Few‑shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel‑wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning‑based segmentation frameworks, our method, named DSV‑LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt‑based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal‑5^i and COCO‑20^i demonstrate that our framework achieves state‑of‑the‑art performance‑by a significant margin‑demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \hrefhttps://github.com/aminpdik/DSV‑LFShttps://github.com/aminpdik/DSV‑LFS

Abstract:
Recent real‑time semantic segmentation models, whether single‑branch or multi‑branch, achieve good performance and speed. However, their speed is limited by multi‑path blocks, and some depend on high‑performance teacher models for training. To overcome these issues, we propose Golden Cudgel Network (GCNet). Specifically, GCNet uses vertical multi‑convolutions and horizontal multi‑paths for training, which are reparameterized into a single convolution for inference, optimizing both performance and speed. This design allows GCNet to self‑enlarge during training and self‑contract during inference, effectively becoming a "teacher model" without needing external ones. Experimental results show that GCNet outperforms existing state‑of‑the‑art models in terms of performance and speed on the Cityscapes, CamVid, and Pascal VOC 2012 datasets. The code is available at https://github.com/gyyang23/GCNet.

Abstract:
The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB‑T tasks. To address these challenges, we propose SHIFNet, a novel SAM2‑driven Hybrid Interaction Paradigm that unlocks the potential of SAM2 with linguistic guidance for efficient RGB‑Thermal perception. Our framework consists of two key components: (1) Semantic‑Aware Cross‑modal Fusion (SACF) module that dynamically balances modality contributions through text‑guided affinity learning, overcoming SAM2's inherent RGB bias; (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information through a semantic enhancement module and then combined with category embeddings to amplify cross‑modal semantic consistency. With 32.27M trainable parameters, SHIFNet achieves state‑of‑the‑art segmentation performance on public benchmarks, reaching 89.8% on PST900 and 67.8% on FMB, respectively. The framework facilitates the adaptation of pre‑trained large models to RGB‑T segmentation tasks, effectively mitigating the high costs associated with data collection while endowing robotic systems with comprehensive perception capabilities. The source code will be made publicly available at https://github.com/iAsakiT3T/SHIFNet.

Abstract:
Bird's Eye View (BEV) perception technology is crucial for autonomous driving, as it generates top‑down 2D maps for environment perception, navigation, and decision‑making. Nevertheless, the majority of current BEV map generation studies focusing on visual map generation lack depth‑aware reasoning capabilities. They exhibit limited efficacy in managing occlusions and handling complex environments, with a notable decline in perceptual performance under adverse weather conditions or low‑light scenarios. Therefore, this paper proposes TS‑CGNet, which leverages Temporal‑Spatial fusion with Centerline‑Guided diffusion. This visual framework, grounded in prior knowledge, is designed for integration into any existing network for building BEV maps. Specifically, this framework is decoupled into three parts: Local mapping system involves the initial generation of semantic maps using purely visual information; The Temporal‑Spatial Aligner Module (TSAM) integrates historical information into mapping generation by applying transformation matrices; The Centerline‑Guided Diffusion Model (CGDM) is a prediction module based on the diffusion model. CGDM incorporates centerline information through spatial‑attention mechanisms to enhance semantic segmentation reconstruction. We construct BEV semantic segmentation maps by our methods on the public nuScenes and the robustness benchmarks under various corruptions. Our method improves 1.90%, 1.73%, and 2.87% for perceived ranges of 60x30m, 120x60m, and 240x60m in the task of BEV HD mapping. TS‑CGNet attains an improvement of 1.92% for perceived ranges of 100x100m in the task of BEV semantic mapping. Moreover, TS‑CGNet achieves an average improvement of 2.92% in detection accuracy under varying weather conditions and sensor interferences in the perception range of 240x60m. The source code will be publicly available at https://github.com/krabs‑H/TS‑CGNet.

Abstract:
Generalist models have achieved remarkable success in both language and vision‑language tasks, showcasing the potential of unified modeling. However, effectively integrating fine‑grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task‑specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that Unifies Fine‑grained visual perception tasks through an Open‑ended language interface. By transforming all perception targets into the language space, \ours unifies object‑level detection, pixel‑level segmentation, and image‑level vision‑language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine‑grained perception and vision‑language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task‑specific designs. After multi‑task training on five standard visual perception datasets, \ours outperforms the previous state‑of‑the‑art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine‑grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.

Abstract:
Using Quadrics as the object representation has the benefits of both generality and closed‑form projection derivation between image and world spaces. Although numerous constraints have been proposed for dual quadric reconstruction, we found that many of them are imprecise and provide minimal improvements to localization.After scrutinizing the existing constraints, we introduce a concise yet more precise convex hull‑based algebraic constraint for object landmarks, which is applied to object reconstruction, frontend pose estimation, and backend bundle adjustment.This constraint is designed to fully leverage precise semantic segmentation, effectively mitigating mismatches between complex‑shaped object contours and dual quadrics.Experiments on public datasets demonstrate that our approach is applicable to both monocular and RGB‑D SLAM and achieves improved object mapping and localization than existing quadric SLAM methods. The implementation of our method is available at https://github.com/tiev‑tongji/convexhull‑based‑algebraic‑constraint.

Abstract:
Existing dataset pruning techniques primarily focus on classification tasks, limiting their applicability to more complex and practical tasks like instance segmentation. Instance segmentation presents three key challenges: pixel‑level annotations, instance area variations, and class imbalances, which significantly complicate dataset pruning efforts. Directly adapting existing classification‑based pruning methods proves ineffective due to their reliance on time‑consuming model training process. To address this, we propose a novel Training‑Free Dataset Pruning (TFDP) method for instance segmentation. Specifically, we leverage shape and class information from image annotations to design a Shape Complexity Score (SCS), refining it into a Scale‑Invariant (SI‑SCS) and Class‑Balanced (CB‑SCS) versions to address instance area variations and class imbalances, all without requiring model training. We achieve state‑of‑the‑art results on VOC 2012, Cityscapes, and COCO datasets, generalizing well across CNN and Transformer architectures. Remarkably, our approach accelerates the pruning process by an average of 1349× on COCO compared to the adapted baselines. Source code is available at: https://github.com/he‑y/dataset‑pruning‑for‑instance‑segmentation

Abstract:
LiDAR point cloud is essential for autonomous vehicles, but motion distortions from dynamic objects degrade the data quality. While previous work has considered distortions caused by ego motion, distortions caused by other moving objects remain largely overlooked, leading to errors in object shape and position. This distortion is particularly pronounced in high‑speed environments such as highways and in multi‑LiDAR configurations, a common setup for heavy vehicles. To address this challenge, we introduce HiMo, a pipeline that repurposes scene flow estimation for non‑ego motion compensation, correcting the representation of dynamic objects in point clouds. During the development of HiMo, we observed that existing self‑supervised scene flow estimators often produce degenerate or inconsistent estimates under high‑speed distortion. We further propose SeFlow++, a real‑time scene flow estimator that achieves state‑of‑the‑art performance on both scene flow and motion compensation. Since well‑established motion distortion metrics are absent in the literature, we introduce two evaluation metrics: compensation accuracy at a point level and shape similarity of objects. We validate HiMo through extensive experiments on Argoverse 2, ZOD, and a newly collected real‑world dataset featuring highway driving and multi‑LiDAR‑equipped heavy vehicles. Our findings show that HiMo improves the geometric consistency and visual fidelity of dynamic objects in LiDAR point clouds, benefiting downstream tasks such as semantic segmentation and 3D detection. See https://kin‑zhang.github.io/HiMo for more details.

Abstract:
Light field cameras capture multi‑view observations within a single exposure. However, existing studies are typically tailored to specific LF representations, leaving the field without a unified learning framework. To bridge this gap, we present LFX, the first unified framework for LF perception. LFX establishes a representation‑invariant feature modulation space, enabling it to adapt to heterogeneous LF representations and diverse perception tasks. Specifically, we propose Field‑of‑Parallax Angular Subspace Modeling (FoP‑ASM), which assigns an independent angular marker to each auxiliary view, enabling view‑wise independent modeling. Meanwhile, shared manifold subspace constraints and regularization losses enforce globally consistent semantic modulation across views. Extensive evaluations across three LF benchmarks show that LFX achieves state‑of‑the‑art results across distinct LF representations, outperforming representation‑specific methods by up to 12% and 20% with 0.029/0.027 MAE for salient object detection, and achieving 84.37 mIoU for semantic segmentation. The source code will be made publicly available at https://github.com/FeiT‑FeiTeng/LFX.

Abstract:
Top‑down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer‑grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top‑down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub‑networks: 1) a Base‑Net that encodes low/mid‑level features; 2) a lightweight Overview‑Net that generates dynamic top‑down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus‑Net that performs finer‑grained perception guided by top‑down attention (i.e., look closely next). To fully unleash the power of top‑down attention, we further propose a novel context‑mixing dynamic convolution (ContMix) that effectively models long‑range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK‑T achieves a Top‑1 accuracy of 84.2%, significantly surpassing ConvNeXt‑B while using only around one‑third of the FLOPs/parameters. On object detection, our OverLoCK‑S clearly surpasses MogaNet‑B by 1% in AP^b. On semantic segmentation, our OverLoCK‑T remarkably improves UniRepLKNet‑T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.

Abstract:
While 3D instance segmentation (3DIS) has advanced significantly, most existing methods assume that all object classes are known in advance and uniformly distributed. However, this assumption is unrealistic in dynamic, real‑world environments where new classes emerge gradually and exhibit natural imbalance. Although some approaches address the emergence of new classes, they often overlook class imbalance, which leads to suboptimal performance, particularly on rare categories. To tackle this, we propose \ourmethodbf, a unified framework for CLass‑incremental Imbalance‑aware 3DIS. Building upon established exemplar replay (ER) strategies, we show that ER alone is insufficient to achieve robust performance under memory constraints. To mitigate this, we introduce a novel pseudo‑label generator (PLG) that extends supervision to previously learned categories by leveraging predictions from a frozen model trained on prior tasks. Despite its promise, PLG tends to be biased towards frequent classes. Therefore, we propose a class‑balanced re‑weighting (CBR) scheme that estimates object frequencies from pseudo‑labels and dynamically adjusts training bias, without requiring access to past data. We design and evaluate three incremental scenarios for 3DIS on the challenging ScanNet200 dataset and additionally validate our method for semantic segmentation on ScanNetV2. Our approach achieves state‑of‑the‑art results, surpassing prior work by up to 16.76% mAP for instance segmentation and approximately 30% mIoU for semantic segmentation, demonstrating strong generalisation across both frequent and rare classes. Code is available at: https://github.com/vgthengane/CLIMB3D

Abstract:
Autonomous driving simulators provide an effective and low‑cost alternative for evaluating or enhancing visual perception models. However, the reliability of evaluation depends on the diversity and realism of the generated scenes. Extreme weather conditions, particularly extreme rainfalls, are rare and costly to capture in real‑world settings. While simulated environments can help address this limitation, existing rainy image synthesizers often suffer from poor controllability over illumination and limited realism, which significantly undermines the effectiveness of the model evaluation. To that end, we propose a learning‑from‑rendering rainy image synthesizer, which combines the benefits of the realism of rendering‑based methods and the controllability of learning‑based methods. To validate the effectiveness of our extreme rainy image synthesizer on semantic segmentation task, we require a continuous set of well‑labeled extreme rainy images. By integrating the proposed synthesizer with the CARLA driving simulator, we develop CARLARain an extreme rainy street scene simulator which can obtain paired rainy‑clean images and labels under complex illumination conditions. Qualitative and quantitative experiments validate that CARLARain can effectively improve the accuracy of semantic segmentation models in extreme rainy scenes, with the models' accuracy (mIoU) improved by 5% ‑ 8% on the synthetic dataset and significantly enhanced in real extreme rainy scenarios under complex illuminations. Our source code and datasets are available at https://github.com/kb824999404/CARLARain/.

Abstract:
The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision‑language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off‑the‑shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at 224 × 224px, while the "high‑resolution" versions are around 378‑448px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low‑resolution vision encoders while picking up on fine‑grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model training using RADIO as a way of providing richer targets for distillation. Code available at https://github.com/NVlabs/FeatSharp .

Abstract:
Adversarial attacks pose a significant threat to deep learning models, particularly in safety‑critical applications like healthcare and autonomous driving. Recently, patch based attacks have demonstrated effectiveness in real‑time inference scenarios owing to their 'drag and drop' nature. Following this idea for Semantic Segmentation (SS), here we propose a novel Expectation Over Transformation (EOT) based adversarial patch attack that is more realistic for autonomous vehicles. To effectively train this attack we also propose a 'simplified' loss function that is easy to analyze and implement. Using this attack as our basis, we investigate whether adversarial patches once optimized on a specific SS model, can fool other models or architectures. We conduct a comprehensive cross‑model transferability analysis of adversarial patches trained on SOTA Convolutional Neural Network (CNN) models such PIDNet‑S, PIDNet‑M and PIDNet‑L, among others. Additionally, we also include the Segformer model to study transferability to Vision Transformers (ViTs). All of our analysis is conducted on the widely used Cityscapes dataset. Our study reveals key insights into how model architectures (CNN vs CNN or CNN vs. Transformer‑based) influence attack susceptibility. In particular, we conclude that although the transferability (effectiveness) of attacks on unseen images of any dimension is really high, the attacks trained against one particular model are minimally effective on other models. And this was found to be true for both ViT and CNN based models. Additionally our results also indicate that for CNN‑based models, the repercussions of patch attacks are local, unlike ViTs. Per‑class analysis reveals that simple‑classes like 'sky' suffer less misclassification than others. The code for the project is available at: https://github.com/p‑shekhar/adversarial‑patch‑transferability

Abstract:
Weakly supervised semantic segmentation (WSSS) typically utilizes limited semantic annotations to obtain initial Class Activation Maps (CAMs). However, due to the inadequate coupling between class activation responses and semantic information in high‑dimensional space, the CAM is prone to object co‑occurrence or under‑activation, resulting in inferior recognition accuracy. To tackle this issue, we propose DOEI, Dual Optimization of Embedding Information, a novel approach that reconstructs embedding representations through semantic‑aware attention weight matrices to optimize the expression capability of embedding information. Specifically, DOEI amplifies tokens with high confidence and suppresses those with low confidence during the class‑to‑patch interaction. This alignment of activation responses with semantic information strengthens the propagation and decoupling of target features, enabling the generated embeddings to more accurately represent target features in high‑level semantic space. In addition, we propose a hybrid‑feature alignment module in DOEI that combines RGB values, embedding‑guided features, and self‑attention weights to increase the reliability of candidate tokens. Comprehensive experiments show that DOEI is an effective plug‑and‑play module that empowers state‑of‑the‑art visual transformer‑based WSSS models to significantly improve the quality of CAMs and segmentation performance on popular benchmarks, including PASCAL VOC (+3.6%, +1.5%, +1.2% mIoU) and MS COCO (+1.2%, +1.6% mIoU). Code will be available at https://github.com/AIGeeksGroup/DOEI.

Abstract:
In real‑world scenarios, environment changes caused by human or agent activities make it extremely challenging for robots to perform various long‑term tasks. Recent works typically struggle to effectively understand and adapt to dynamic environments due to the inability to update their environment representations in memory according to environment changes and lack of fine‑grained reconstruction of the environments. To address these challenges, we propose DynamicGSG, a dynamic, high‑fidelity, open‑vocabulary scene graph construction system leveraging Gaussian splatting. DynamicGSG builds hierarchical scene graphs using advanced vision language models to represent the spatial and semantic relationships between objects in the environments, utilizes a joint feature loss we designed to supervise Gaussian instance grouping while optimizing the Gaussian maps, and locally updates the Gaussian scene graphs according to real environment changes for long‑term environment adaptation. Experiments and ablation studies demonstrate the performance and efficacy of our proposed method in terms of semantic segmentation, language‑guided object retrieval, and reconstruction quality. Furthermore, we validate the dynamic updating capabilities of our system in real laboratory environments. The source code and supplementary experimental materials will be released at:~\hrefhttps://github.com/GeLuzhou/Dynamic‑GSGhttps://github.com/GeLuzhou/Dynamic‑GSG.

Abstract:
Semi‑supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilizing large amounts of unlabeled data with limited labeled samples. Existing methods often suffer from coupling, where over‑reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by limited boundary‑awareness and ambiguous edge cues. To address these issues, we propose CW‑BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo‑labels. Additionally, we leverage boundary‑delineation techniques, which, despite being extensively explored in weakly‑supervised semantic segmentation (WSSS), remain underutilized in SSSS. Specifically, our method: (1) reduces coupling via a confidence‑weighted loss that adjusts pseudo‑label influence based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo‑labels based on model performance, (3) tackles boundary blur using a boundary‑aware module to refine segmentation near object edges, and (4) reduces label noise through a confidence decay strategy that progressively refines pseudo‑labels during training. Extensive experiments on Pascal VOC 2012 and Cityscapes demonstrate that CW‑BASS achieves state‑of‑the‑art performance. Notably, CW‑BASS achieves a 65.9% mIoU on Cityscapes under a challenging and underexplored 1/30 (3.3%) split (100 images), highlighting its effectiveness in limited‑label settings. Our code is available at https://github.com/psychofict/CW‑BASS.

Abstract:
Camouflaged Object Segmentation (COS) remains challenging because camouflaged objects exhibit only subtle visual differences from their backgrounds and single‑modality RGB methods provide limited cues, leading researchers to explore multimodal data to improve segmentation accuracy. In this work, we presenet MultiCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. MultiCOS comprises two modules: Bi‑space Fusion Segmentor (BFSer), which employs a state space and a latent space fusion mechanism to integrate cross‑modal features within a shared representation and employs a fusion‑feedback mechanism to refine context‑specific features, and Cross‑modal Knowledge Learner (CKLer), which leverages external multimodal datasets to generate pseudo‑modal inputs and establish cross‑modal semantic associations, transferring knowledge to COS models when real multimodal pairs are missing. When real multimodal COS data are unavailable, CKLer yields additional segmentation gains using only non‑COS multimodal sources. Experiments on standard COS benchmarks show that BFSer outperforms existing multimodal baselines with both real and pseudo‑modal data. Code will be released at \hrefhttps://github.com/cnyvfang/MultiCOSGitHub.

Abstract:
Weeds are one of the major reasons for crop yield loss but current weeding practices fail to manage weeds in an efficient and targeted manner. Effective weed management is especially important for crops with high worldwide production such as maize, to maximize crop yield for meeting increasing global demands. Advances in near‑sensing and computer vision enable the development of new tools for weed management. Specifically, state‑of‑the‑art segmentation models, coupled with novel sensing technologies, can facilitate timely and accurate weeding and monitoring systems. However, learning‑based approaches require annotated data and show a lack of generalization to aerial imaging for different crops. We present a novel dataset for semantic and instance segmentation of crops and weeds in agricultural maize fields. The multispectral UAV‑based dataset contains images with RGB, red‑edge, and near‑infrared bands, a large number of plant instances, dense annotations for maize and four weed classes, and is multitemporal. We provide extensive baseline results for both tasks, including probabilistic methods to quantify prediction uncertainty, improve model calibration, and demonstrate the approach's applicability to out‑of‑distribution data. The results show the effectiveness of the two additional bands compared to RGB only, and better performance in our target domain than models trained on existing datasets. We hope our dataset advances research on methods and operational systems for fine‑grained weed identification, enhancing the robustness and applicability of UAV‑based weed management. The dataset and code are available at https://github.com/GFZ/weedsgalore

Abstract:
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects, while the difficulties lie in taking into account both spatial texture structures and temporal motion cues. Existing methods based on video frames encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion due to the complexities of accurate image‑based motion modeling. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities, but instead lead to challenges in segmenting pixel‑level object masks due to the lack of dense texture structures in events. To address these two limitations imposed by unimodal settings, we propose the first instance‑level moving object segmentation framework that integrates complementary texture and motion cues. Our model incorporates implicit cross‑modal masked attention augmentation, explicit contrastive feature learning, and flow‑guided motion enhancement to exploit dense texture information from a single image and rich motion information from events, respectively. By leveraging the augmented texture and motion features, we separate mask segmentation from motion classification to handle varying numbers of independently moving objects. Through extensive evaluations on multiple datasets, as well as ablation experiments with different input settings and real‑time efficiency analysis of the proposed framework, we believe that our first attempt to incorporate image and event data for practical deployment can provide new insights for future work in event‑based motion related works. The source code with model training and pre‑trained weights is released at https://npucvr.github.io/EvInsMOS

Abstract:
State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state‑of‑the‑art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data‑driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state‑of‑the‑art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state‑of‑the‑art CNNs and ViTs. Code will be available at https://github.com/ltzovo/DAMamba.

Abstract:
Deep learning (DL) techniques have emerged as promising solutions for medical wound tissue segmentation. However, a notable limitation in this field is the lack of publicly available labelled datasets and a standardised performance evaluation of state‑of‑the‑art DL models on such datasets. This study addresses this gap by comprehensively evaluating various DL models for wound tissue segmentation using a novel dataset. We have curated a dataset comprising 147 wound images exhibiting six tissue types: slough, granulation, maceration, necrosis, bone, and tendon. The dataset was meticulously labelled for semantic segmentation employing supervised machine learning techniques. Three distinct labelling formats were developed ‑‑ full image, patch, and superpixel. Our investigation encompassed a wide array of DL segmentation and classification methodologies, ranging from conventional approaches like UNet, to generative adversarial networks such as cGAN, and modified techniques like FPN+VGG16. Also, we explored DL‑based classification methods (e.g., ResNet50) and machine learning‑based classification leveraging DL features (e.g., AlexNet+RF). In total, 82 wound tissue segmentation models were derived across the three labelling formats. Our analysis yielded several notable findings, including identifying optimal DL models for each labelling format based on weighted average Dice or F1 scores. Notably, FPN+VGG16 emerged as the top‑performing DL model for wound tissue segmentation, achieving a dice score of 82.25%. This study provides a valuable benchmark for evaluating wound image segmentation and classification models, offering insights to inform future research and clinical practice in wound care. The labelled dataset created in this study is available at https://github.com/akabircs/WoundTissue.

Abstract:
Nuclear instance segmentation has played a critical role in pathology image analysis. The main challenges arise from the difficulty in accurately segmenting instances and the high cost of precise mask‑level annotations for fully‑supervised training.In this work, we propose a fourier guidance framework for solving the weakly‑supervised nuclear instance segmentation problem. In this framework, we construct a fourier guidance module to fuse the priori information into the training process of the model, which facilitates the model to capture the relevant features of the nuclear. Meanwhile, in order to further improve the model's ability to represent the features of nuclear, we propose the guide‑based instance level contrastive module. This module makes full use of the framework's own properties and guide information to effectively enhance the representation features of nuclear. We show on two public datasets that our model can outperform current SOTA methods under fully‑supervised design, and in weakly‑supervised experiments, with only a small amount of labeling our model still maintains close to the performance under full supervision.In addition, we also perform generalization experiments on a private dataset, and without any labeling, our model is able to segment nuclear images that have not been seen during training quite effectively. As open science, all codes and pre‑trained models are available at https://github.com/LQY404/FrGNet.

Abstract:
Existing methods derive clinical functional metrics from ventricular semantic segmentation in cardiac cine sequences. While performing well on overall segmentation, they struggle with the end slices. To address this, we extract global uncertainty from segmentation variance and use it in our ensemble learning method, Streaming, for classifier weighting, balancing overall and end‑slice performance. We introduce the End Coefficient (EC) to quantify end‑slice accuracy. Experiments on ACDC and M\&Ms datasets show that our framework achieves near state‑of‑the‑art Dice Similarity Coefficient (DSC) and outperforms all models on end‑slice performance, improving patient‑specific segmentation accuracy. We open‑sourced our code on https://github.com/LEw1sin/Uncertainty‑Ensemble.

Abstract:
We introduce Knowledge Swapping, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user\‑specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock‑on feature hierarchy, we find that incremental learning typically progresses from low\‑level representations to higher\‑level semantics, whereas forgetting tends to occur in the opposite direction\‑starting from high‑level semantics and moving down to low‑level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of Learning Before Forgetting. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \hrefhttps://github.com/xingmingyu123456/KnowledgeSwappinghttps://github.com/xingmingyu123456/KnowledgeSwapping.

Abstract:
In this paper, we explore a principal way to enhance the quality of widely pre‑existing coarse masks, enabling them to serve as reliable training data for segmentation models to reduce the annotation cost. In contrast to prior refinement techniques that are tailored to specific models or tasks in a close‑world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the mask refinement task. The core technique of our model is the noise‑tolerant prompting scheme. Specifically, we introduce a multi‑prompt excavation strategy to mine diverse input prompts for SAM (i.e., distance‑guided points, context‑aware elastic bounding boxes, and Gaussian‑style masks) from initial coarse masks. These prompts can collaborate with each other to mitigate the effect of defects in coarse masks. In particular, considering the difficulty of SAM to handle the multi‑object case in semantic segmentation, we introduce a split‑then‑merge (STM) pipeline. Additionally, we extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset. This step is self‑boosted and requires no additional annotation. The proposed framework is versatile and can flexibly cooperate with existing segmentation methods. We evaluate our mask framework on a wide range of benchmarks under different settings, demonstrating better accuracy and efficiency. SAMRefiner holds significant potential to expedite the evolution of refinement tools. Our code is available at https://github.com/linyq2117/SAMRefiner.

Abstract:
Few‑shot semantic segmentation (FSS) methods have shown great promise in handling data‑scarce scenarios, particularly in medical image segmentation tasks. However, most existing FSS architectures lack sufficient interpretability and fail to fully incorporate the underlying physical structures of semantic regions. To address these issues, in this paper, we propose a novel deep unfolding network, called the Learned Mumford‑Shah Network (LMS‑Net), for the FSS task. Specifically, motivated by the effectiveness of pixel‑to‑prototype comparison in prototypical FSS methods and the capability of deep priors to model complex spatial structures, we leverage our learned Mumford‑Shah model (LMS model) as a mathematical foundation to integrate these insights into a unified framework. By reformulating the LMS model into prototype update and mask update tasks, we propose an alternating optimization algorithm to solve it efficiently. Further, the iterative steps of this algorithm are unfolded into corresponding network modules, resulting in LMS‑Net with clear interpretability. Comprehensive experiments on three publicly available medical segmentation datasets verify the effectiveness of our method, demonstrating superior accuracy and robustness in handling complex structures and adapting to challenging segmentation scenarios. These results highlight the potential of LMS‑Net to advance FSS in medical imaging applications. Our code will be available at: https://github.com/SDZhang01/LMSNet

Abstract:
Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real‑world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation‑to‑reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero‑shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object‑agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention‑based features from the selfsupervised ViT to filter non‑object masks, and (3) applying K‑Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self‑collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at https://github.com/Yinmlmaoliang/zisvfm.

Abstract:
We tackle open‑vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state‑of‑the‑art open‑vocabulary image segmentation models and region‑aware Vision‑Language Models, we develop an automatic pipeline that generates high‑quality 3D mask‑text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D‑5.6M, a dataset of over 30K annotated scenes with 5.6M mask‑text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open‑vocabulary 3D semantic and instance segmentation. Our approach achieves state‑of‑the‑art results on open‑vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large‑scale training data.

Abstract:
Pre‑training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre‑training (e.g. RGB) and fine‑tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre‑training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre‑trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre‑training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre‑training framework, to enhance the pre‑trained model performance. This framework uses the hybrid‑attention distillation NMI‑HAD as the pre‑training target, a large‑scale mixed dataset InfMix for pre‑training, and a last‑layer feature pyramid network LL‑FPN for fine‑tuning. Experimental results show that UNIP outperforms various pre‑training methods by up to 13.5% in average mIoU on three infrared segmentation tasks, evaluated using fine‑tuning and linear probing metrics. UNIP‑S achieves performance on par with MAE‑L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state‑of‑the‑art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at https://github.com/casiatao/UNIP.

Abstract:
Transferability, the ability of adversarial examples crafted for one model to deceive other models, is crucial for black‑box attacks. Despite advancements in attack methods for semantic segmentation, transferability remains limited, reducing their effectiveness in real‑world applications. To address this, we introduce the Feature Similarity Projected Gradient Descent (FSPGD) attack, a novel black‑box approach that enhances both attack performance and transferability. Unlike conventional segmentation attacks that rely on output predictions for gradient calculation, FSPGD computes gradients from intermediate layer features. Specifically, our method introduces a loss function that targets local information by comparing features between clean images and adversarial examples, while also disrupting contextual information by accounting for spatial relationships between objects. Experiments on Pascal VOC 2012 and Cityscapes datasets demonstrate that FSPGD achieves superior transferability and attack performance, establishing a new state‑of‑the‑art benchmark. Code is available at https://github.com/KU‑AIVS/FSPGD.

Abstract:
Industrial defect segmentation is critical for manufacturing quality control. Due to the scarcity of training defect samples, few‑shot semantic segmentation (FSS) holds significant value in this field. However, existing studies mostly apply FSS to tackle defects on simple textures, without considering more diverse scenarios. This paper aims to address this gap by exploring FSS in broader industrial products with various defect types. To this end, we contribute a new real‑world dataset and reorganize some existing datasets to build a more comprehensive few‑shot defect segmentation (FDS) benchmark. On this benchmark, we thoroughly investigate metric learning‑based FSS methods, including those based on meta‑learning and those based on Vision Foundation Models (VFMs). We observe that existing meta‑learning‑based methods are generally not well‑suited for this task, while VFMs hold great potential. We further systematically study the applicability of various VFMs in this task, involving two paradigms: feature matching and the use of Segment Anything (SAM) models. We propose a novel efficient FDS method based on feature matching. Meanwhile, we find that SAM2 is particularly effective for addressing FDS through its video track mode. The contributed dataset and code will be available at: https://github.com/liutongkun/GFDS.

Abstract:
Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel‑wise, while regional and boundary‑focused loss functions often incur high computational costs or are restricted to small‑scale regions. To address this limitation, we propose the complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual information from subband images decomposed by a complex steerable pyramid. The complex steerable pyramid captures features across multiple orientations and preserves structural similarity across scales. Meanwhile, mutual information is well‑suited to capturing high‑dimensional directional features and offers greater noise robustness. Extensive experiments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel‑wise accuracy and topological metrics compared to state‑of‑the‑art methods, while introducing minimal computational overhead. Our code is available at https://github.com/lurenhaothu/CWMI

Abstract:
Nucleus segmentation is an important analysis task in digital pathology. However, methods for automatic segmentation often struggle with new data from a different distribution, requiring users to manually annotate nuclei and retrain data‑specific models. Vision foundation models (VFMs), such as the Segment Anything Model (SAM), offer a more robust alternative for automatic and interactive segmentation. Despite their success in natural images, a foundation model for nucleus segmentation in histopathology is still missing. Initial efforts to adapt SAM have shown some success, but did not yet introduce a comprehensive model for diverse segmentation tasks. To close this gap, we introduce PathoSAM, a VFM for nucleus segmentation, based on training SAM on a diverse dataset. Our extensive experiments show that it is the new state‑of‑the‑art model for automatic and interactive nucleus instance segmentation in histopathology. We also demonstrate how it can be adapted for other segmentation tasks, including semantic nucleus segmentation. For this task, we show that it yields results better than popular methods, while not yet beating the state‑of‑the‑art, CellViT. Our models are open‑source and compatible with popular tools for data annotation. We also provide scripts for whole‑slide image segmentation. Our code and models are publicly available at https://github.com/computational‑cell‑analytics/patho‑sam.

Abstract:
The Cerrado faces increasing environmental pressures, necessitating accurate land use and land cover (LULC) mapping despite challenges such as class imbalance and visually similar categories. To address this, we present CerraData‑4MM, a multimodal dataset combining Sentinel‑1 Synthetic Aperture Radar (SAR) and Sentinel‑2 MultiSpectral Imagery (MSI) with 10m spatial resolution. The dataset includes two hierarchical classification levels with 7 and 14 classes, respectively, focusing on the diverse Bico do Papagaio ecoregion. We highlight CerraData‑4MM's capacity to benchmark advanced semantic segmentation techniques by evaluating a standard U‑Net and a more sophisticated Vision Transformer (ViT) model. The ViT achieves superior performance in multimodal scenarios, with the highest macro F1‑score of 57.60% and a mean Intersection over Union (mIoU) of 49.05% at the first hierarchical level. Both models struggle with minority classes, particularly at the second hierarchical level, where U‑Net's performance drops to an F1‑score of 18.16%. Class balancing improves representation for underrepresented classes but reduces overall accuracy, underscoring the trade‑off in weighted training. CerraData‑4MM offers a challenging benchmark for advancing deep learning models to handle class imbalance and multimodal data fusion. Code, trained models, and data are publicly available at https://github.com/ai4luc/CerraData‑4MM.

Abstract:
Semantic segmentation on LiDAR imaging is increasingly gaining attention, as it can provide useful knowledge for perception systems and potential for autonomous driving. However, collecting and labeling real LiDAR data is an expensive and time‑consuming task. While datasets such as SemanticKITTI have been manually collected and labeled, the introduction of simulation tools such as CARLA, has enabled the creation of synthetic datasets on demand. In this work, we present a modified CARLA simulator designed with LiDAR semantic segmentation in mind, with new classes, more consistent object labeling with their counterparts from real datasets such as SemanticKITTI, and the possibility to adjust the object class distribution. Using this tool, we have generated SynthmanticLiDAR, a synthetic dataset for semantic segmentation on LiDAR imaging, designed to be similar to SemanticKITTI, and we evaluate its contribution to the training process of different semantic segmentation algorithms by using a naive transfer learning approach. Our results show that incorporating SynthmanticLiDAR into the training process improves the overall performance of tested algorithms, proving the usefulness of our dataset, and therefore, our adapted CARLA simulator. The dataset and simulator are available in https://github.com/vpulab/SynthmanticLiDAR.

Abstract:
In real‑world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large‑scale pre‑trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test‑time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up‑to‑date literature at https://github.com/donghao51/Awesome‑Multimodal‑Adaptation.

Abstract:
We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self‑attention. The local interactions are derived from transforming a standard convolutional network, i.e., ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory‑intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top‑1 accuracy of 80.4% on ImageNet‑1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high‑resolution inputs in these scenarios.

Abstract:
Training semantic segmenter with synthetic data has been attracting great attention due to its easy accessibility and huge quantities. Most previous methods focused on producing large‑scale synthetic image‑annotation samples and then training the segmenter with all of them. However, such a solution remains a main challenge in that the poor‑quality samples are unavoidable, and using them to train the model will damage the training process. In this paper, we propose a training‑free Synthetic Data Selection (SDS) strategy with CLIP to select high‑quality samples for building a reliable synthetic dataset. Specifically, given massive synthetic image‑annotation pairs, we first design a Perturbation‑based CLIP Similarity (PCS) to measure the reliability of synthetic image, thus removing samples with low‑quality images. Then we propose a class‑balance Annotation Similarity Filter (ASF) by comparing the synthetic annotation with the response of CLIP to remove the samples related to low‑quality annotations. The experimental results show that using our method significantly reduces the data size by half, while the trained segmenter achieves higher performance. The code is released at https://github.com/tanghao2000/SDS.

Abstract:
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision‑language understanding, pixel‑level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose ReferDINO, a strong RVOS model that inherits region‑level vision‑language alignment from foundational visual grounding models, and is further endowed with pixel‑level dense perception and cross‑modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding‑guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object‑consistent temporal enhancer that injects pretrained time‑varying text features into inter‑frame interaction to capture object‑aware dynamic changes. Moreover, a confidence‑aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% (\mathcalJ&\mathcalF) on Ref‑YouTube‑VOS) with real‑time inference speed (51 FPS).

Abstract:
The point clouds collected by the Airborne Laser Scanning (ALS) system provide accurate 3D information of urban land covers. By utilizing multi‑temporal ALS point clouds, semantic changes in urban area can be captured, demonstrating significant potential in urban planning, emergency management, and infrastructure maintenance. Existing 3D change detection methods struggle to efficiently extract multi‑class semantic information and change features, still facing the following challenges: (1) the difficulty of accurately modeling cross‑temporal point clouds spatial relationships for effective change feature extraction; (2) class imbalance of change samples which hinders distinguishability of semantic features; (3) the lack of real‑world datasets for 3D semantic change detection. To resolve these challenges, we propose the Multi‑task Enhanced Cross‑temporal Point Transformer (ME‑CPT) network. ME‑CPT establishes spatiotemporal correspondences between point cloud across different epochs and employs attention mechanisms to jointly extract semantic change features, facilitating information exchange and change comparison. Additionally, we incorporate a semantic segmentation task and through the multi‑task training strategy, further enhance the distinguishability of semantic features, reducing the impact of class imbalance in change types. Moreover, we release a 22.5 km^2 3D semantic change detection dataset, offering diverse scenes for comprehensive evaluation. Experiments on multiple datasets show that the proposed MT‑CPT achieves superior performance compared to existing state‑of‑the‑art methods. The source code and dataset will be released upon acceptance at https://github.com/zhangluqi0209/ME‑CPT.

Abstract:
With the rapid development of diffusion models, text‑to‑image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall‑E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general‑purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE‑E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall‑E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi‑style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general‑purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine‑e.

Abstract:
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG‑SAM 2, to address these challenges. Specifically, MPG‑SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we introduce a hierarchical global‑historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG‑SAM 2 and the effectiveness of our proposed modules. The code is available at https://github.com/rongfu‑dsb/MPG‑SAM2.

Abstract:
Data augmentation is a widely used and effective technique to improve the generalization performance of deep neural networks. Yet, despite often facing limited data availability when working with medical images, it is frequently underutilized. This appears to come from a gap in our collective understanding of the efficacy of different augmentation techniques across different tasks and modalities. One modality where this is especially true is ultrasound imaging. This work addresses this gap by analyzing the effectiveness of different augmentation techniques at improving model performance across a wide range of ultrasound image analysis tasks. To achieve this, we introduce a new standardized benchmark of 14 ultrasound image classification and semantic segmentation tasks from 10 different sources and covering 11 body regions. Our results demonstrate that many of the augmentations commonly used for tasks on natural images are also effective on ultrasound images, even more so than augmentations developed specifically for ultrasound images in some cases. We also show that diverse augmentation using TrivialAugment, which is widely used for natural images, is also effective for ultrasound images. Moreover, our proposed methodology represents a structured approach for assessing various data augmentations that can be applied to other contexts and modalities.

Abstract:
Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at https://github.com/computational‑cell‑analytics/medico‑sam. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.

Abstract:
Monte‑Carlo (MC) Dropout provides a practical solution for estimating predictive distributions in deterministic neural networks. Traditional dropout, applied within the signal space, may fail to account for frequency‑related noise common in medical imaging, leading to biased predictive estimates. A novel approach extends Dropout to the frequency domain, allowing stochastic attenuation of signal frequencies during inference. This creates diverse global textural variations in feature maps while preserving structural integrity ‑‑ a factor we hypothesize and empirically show is contributing to accurately estimating uncertainties in semantic segmentation. We evaluated traditional MC‑Dropout and the MC‑frequency Dropout in three segmentation tasks involving different imaging modalities: (i) prostate zones in biparametric MRI, (ii) liver tumors in contrast‑enhanced CT, and (iii) lungs in chest X‑ray scans. Our results show that MC‑Frequency Dropout improves calibration, convergence, and semantic uncertainty, thereby improving prediction scrutiny, boundary delineation, and has the potential to enhance medical decision‑making.

Abstract:
This paper proposes a novel approach to few‑shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck‑mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic‑to‑real generalization with a J\&F score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a J\&F score of 71.5 in semi‑supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few‑shot segmentation tasks.

Abstract:
Remote sensing change captioning (RSICC) aims to describe changes between bitemporal images in natural language. Existing methods often fail under challenges like illumination differences, viewpoint changes, blur effects, leading to inaccuracies, especially in no‑change regions. Moreover, the images acquired at different spatial resolutions and have registration errors tend to affect the captions. To address these issues, we introduce SECOND‑CC, a novel RSICC dataset featuring high‑resolution RGB image pairs, semantic segmentation maps, and diverse real‑world scenarios. SECOND‑CC which contains 6,041 pairs of bitemporal RS images and 30,205 sentences describing the differences between images. Additionally, we propose MModalCC, a multimodal framework that integrates semantic and visual data using advanced attention mechanisms, including Cross‑Modal Cross Attention (CMCA) and Multimodal Gated Cross Attention (MGCA). Detailed ablation studies and attention visualizations further demonstrate its effectiveness and ability to address RSICC challenges. Comprehensive experiments show that MModalCC outperforms state‑of‑the‑art RSICC methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on BLEU4 score and +9.6% improvement on CIDEr score. We will make our dataset and codebase publicly available to facilitate future research at https://github.com/ChangeCapsInRS/SecondCC

Abstract:
Current image tokenization methods require a large number of tokens to capture the information contained within images. Although the amount of information varies across images, most image tokenizers only support fixed‑length tokenization, leading to inefficiency in token allocation. In this study, we introduce One‑D‑Piece, a discrete image tokenizer designed for variable‑length tokenization, achieving quality‑controllable mechanism. To enable variable compression rate, we introduce a simple but effective regularization mechanism named "Tail Token Drop" into discrete one‑dimensional image tokenizers. This method encourages critical information to concentrate at the head of the token sequence, enabling support of variadic tokenization, while preserving state‑of‑the‑art reconstruction quality. We evaluate our tokenizer across multiple reconstruction quality metrics and find that it delivers significantly better perceptual quality than existing quality‑controllable compression methods, including JPEG and WebP, at smaller byte sizes. Furthermore, we assess our tokenizer on various downstream computer vision tasks, including image classification, object detection, semantic segmentation, and depth estimation, confirming its adaptability to numerous applications compared to other variable‑rate methods. Our approach demonstrates the versatility of variable‑length discrete image tokenization, establishing a new paradigm in both compression efficiency and reconstruction performance. Finally, we validate the effectiveness of tail token drop via detailed analysis of tokenizers.

Abstract:
Self‑supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi‑camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine‑grained object segmentation. To make better use of the above information, we propose Surface representation based Self‑supervised Object Segmentation (Surface‑SOS), a new framework to segment objects for each view by 3D surface representation from multi‑view images of a scene. To model high‑quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface‑SOS is able to refine single‑view segmentation with multi‑view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface‑SOS is the first self‑supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real‑world scenes show that Surface‑SOS always yields finer object masks than its NeRF‑based counterparts and surpasses supervised single‑view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface‑SOS.

Abstract:
Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large‑scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top‑tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best‑performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best‑performing ImageNet‑based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data‑scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: https://github.com/TimJaspers0801/SurgeNet.

Abstract:
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real‑world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self‑supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo‑label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo‑labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at https://github.com/embar111/pgpc

Abstract:
Object removal has so far been dominated by the mask‑and‑inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked‑Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large‑scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions.

Abstract:
In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long‑range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious inter‑frame interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross‑modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state‑of‑the‑art performance in unsupervised video object segmentation but also delivers competitive results in video salient object detection. These findings highlight the method's robust versatility and its adeptness in adapting to a range of segmentation tasks. Source code is available on https://github.com/hy0523/MTNet.

Abstract:
Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi‑object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.

Abstract:
Camouflaged object detection (COD) primarily relies on semantic or instance segmentation methods. While these methods have made significant advancements in identifying the contours of camouflaged objects, they may be inefficient or cost‑effective for tasks that only require the specific location of the object. Object detection algorithms offer an optimized solution for Realistic Camouflaged Object Detection (RCOD) in such cases. However, detecting camouflaged objects remains a formidable challenge due to the high degree of similarity between the features of the objects and their backgrounds. Unlike segmentation methods that perform pixel‑wise comparisons to differentiate between foreground and background, object detectors omit this analysis, further aggravating the challenge. To solve this problem, we propose a camouflage‑aware feature refinement (CAFR) strategy. Since camouflaged objects are not rare categories, CAFR fully utilizes a clear perception of the current object within the prior knowledge of large models to assist detectors in deeply understanding the distinctions between background and foreground. Specifically, in CAFR, we introduce the Adaptive Gradient Propagation (AGP) module that fine‑tunes all feature extractor layers in large detection models to fully refine class‑specific features from camouflaged contexts. We then design the Sparse Feature Refinement (SFR) module that optimizes the transformer‑based feature extractor to focus primarily on capturing class‑specific features in camouflaged scenarios. To facilitate the assessment of RCOD tasks, we manually annotate the labels required for detection on three existing segmentation COD datasets, creating a new benchmark for RCOD tasks. Code and datasets are available at: https://github.com/zhimengXin/RCOD.

Abstract:
3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real‑world conditions, this task usually demands complex models that process multi‑modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long‑sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench‑KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at https://github.com/xrkong/skimba

Abstract:
On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame‑level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global‑level and patch‑level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA‑V val, and SA‑V test, while running at 16 FPS on iPhone 15 Pro Max.

Abstract:
Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large‑scale mask labels is labour‑intensive and time‑consuming. Recently, language‑guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out‑of‑distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large‑vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image‑level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image‑level supervision into the training process of a pixel‑level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category‑wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K‑category semantic segmentation model with the help of ImageNet21K. The code is available at https://github.com/HaojunYu1998/large_voc_seg.

Abstract:
Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine‑grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine‑grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain‑of‑Tasks (CoTasks) and Metric‑based Direct Preference Optimization (M‑DPO). CoTasks decompose a complex task into a sequence of sub‑tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M‑DPO aligns a VideoLLM with evaluation metrics, providing fine‑grained supervision to each task that is well‑aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine‑grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task. Code is available at \urlhttps://github.com/mlvlab/VidChain.

Abstract:
4D panoptic LiDAR segmentation is essential for scene understanding in autonomous driving and robotics, combining semantic and instance segmentation with temporal consistency. Current methods, like 4D‑PLS and 4D‑STOP, use a tracking‑by‑detection methodology, employing deep learning networks to perform semantic and instance segmentation on each frame. To maintain temporal consistency, large‑size instances detected in the current frame are compared and associated with instances within a temporal window that includes the current and preceding frames. However, their reliance on short‑term instance detection, lack of motion estimation, and exclusion of small‑sized instances lead to frequent identity switches and reduced tracking performance. We address these issues with the NextStop1 tracker, which integrates Kalman filter‑based motion estimation, data association, and lifespan management, along with a tracklet state concept to improve prioritization. Evaluated using the LiDAR Segmentation and Tracking Quality (LSTQ) metric on the SemanticKITTI validation set, NextStop demonstrated enhanced tracking performance, particularly for small‑sized objects like people and bicyclists, with fewer ID switches, earlier tracking initiation, and improved reliability in complex environments. The source code is available at https://github.com/AIROTAU/NextStop

Abstract:
The pre‑training and fine‑tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large‑scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey's 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self‑supervised learning model, we adopt BEV‑MAE, a state‑of‑the‑art masked autoencoder for 3D outdoor point clouds, and pre‑train it on the constructed dataset. The pre‑trained models are subsequently fine‑tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre‑trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre‑training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre‑training and fine‑tuning paradigm. The source code and pre‑trained models will be made publicly available at \urlhttps://github.com/martianxiu/ALS_pretraining.

Abstract:
Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer‑based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi‑context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi‑Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text‑relevant queries by considering multi‑context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at https://github.com/Choi58/MTCM.

Abstract:
This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi‑modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one‑shot instruction tuning. Sa2VA combines SAM‑2, a foundation video segmentation model, with MLLM, the advanced vision‑language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM‑2 in producing precise masks, enabling a grounded, multi‑modal understanding of both static and dynamic visual content. Additionally, we introduce Ref‑SAV, an auto‑labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref‑SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real‑world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen‑VL and Intern‑VL, which can be updated with rapid process in current open‑sourced VLMs. Code and models have been provided to the community.

Abstract:
Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio‑temporal information of multi‑scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual‑branch network, 4D‑CS, which integrates point‑based and cluster‑based branches to enable more consistent segmentation. Specifically, in the point‑based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster‑based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point‑wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point‑cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state‑of‑the‑art results on the multi‑scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at https://github.com/NEU‑REAL/4D‑CS.git.

Abstract:
Vision Transformers (ViTs) have shown promise in medical image semantic segmentation (MISS) by capturing long‑range correlations. However, ViTs often struggle to model local spatial information effectively, which is essential for accurately segmenting fine anatomical details, particularly when applied to small datasets without extensive pre‑training. We introduce Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG‑CSUnet), a novel architecture enhancing Transformer‑based models by incorporating learnable radiomic features. This approach integrates dynamically adaptive Gabor and Laplacian of Gaussian (LoG) filters to capture texture, edge, and boundary information, enhancing the feature representation processed by the Transformer model. Our method uniquely combines the long‑range dependency modeling of Transformers with the texture analysis capabilities of Gabor and LoG features. Evaluated on the Synapse multi‑organ and ACDC cardiac segmentation datasets, GLoG‑CSUnet demonstrates significant improvements over state‑of‑the‑art models, achieving a 1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal computational overhead (only 15 and 30 additional parameters, respectively). GLoG‑CSUnet's flexible design allows integration with various base models, offering a promising approach for incorporating radiomics‑inspired feature extraction in Transformer architectures for medical image analysis. The code implementation is available on GitHub at: https://github.com/HAAIL/GLoG‑CSUnet.

Abstract:
Split computing (\neq split learning) is a promising approach to deep learning models for resource‑constrained edge computing systems, where weak sensor (mobile) devices are wirelessly connected to stronger edge servers through channels with limited communication capacity. State‑of‑theart work on split computing presents methods for single tasks such as image classification, object detection, or semantic segmentation. The application of existing methods to multitask problems degrades model accuracy and/or significantly increase runtime latency. In this study, we propose Ladon, the first multi‑task‑head supervised compression model for multi‑task split computing. Experimental results show that the multi‑task supervised compression model either outperformed or rivaled strong lightweight baseline models in terms of predictive performance for ILSVRC 2012, COCO 2017, and PASCAL VOC 2012 datasets while learning compressed representations at its early layers. Furthermore, our models reduced end‑to‑end latency (by up to 95.4%) and energy consumption of mobile devices (by up to 88.2%) in multi‑task split computing scenarios.

Abstract:
Open‑vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end‑to‑end method that formulates panoptic reconstruction through a novel cross‑attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end‑to‑end optimizability. Moreover, this query formulation facilitates the alignment of 2D open‑vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic‑instance segmentation consistency by fusing query‑based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real‑world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/

Abstract:
Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB‑D videos, we propose an online Spatial‑Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi‑view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open‑world environments directly from the RGB‑D video. (https://github.com/runnanchen/PanoSLAM)

Abstract:
Open‑vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene‑by‑scene basis, restricting the capabilities of open‑vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose OVGaussian, a generalizable Open‑Vocabulary 3D semantic segmentation framework based on the 3D Gaussian representation. We first construct a large‑scale 3D scene dataset based on 3DGS, dubbed SegGaussian, which provides detailed semantic and instance annotations for both Gaussian points and multi‑view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi‑view consistent 2D semantic maps. In the next, we propose a Cross‑modal Consistency Learning (CCL) framework that utilizes open‑vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open‑vocabulary semantic segmentation across Gaussian‑based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross‑scene, cross‑domain, and novel‑view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).

Abstract:
Tissue semantic segmentation is one of the key tasks in computational pathology. To avoid the expensive and laborious acquisition of pixel‑level annotations, a wide range of studies attempt to adopt the class activation map (CAM), a weakly‑supervised learning scheme, to achieve pixel‑level tissue segmentation. However, CAM‑based methods are prone to suffer from under‑activation and over‑activation issues, leading to poor segmentation performance. To address this problem, we propose a novel weakly‑supervised semantic segmentation framework for histopathological images based on image‑mixing synthesis and consistency regularization, dubbed HisynSeg. Specifically, synthesized histopathological images with pixel‑level masks are generated for fully‑supervised model training, where two synthesis strategies are proposed based on Mosaic transformation and Bézier mask generation. Besides, an image filtering module is developed to guarantee the authenticity of the synthesized images. In order to further avoid the model overfitting to the occasional synthesis artifacts, we additionally propose a novel self‑supervised consistency regularization, which enables the real images without segmentation masks to supervise the training of the segmentation model. By integrating the proposed techniques, the HisynSeg framework successfully transforms the weakly‑supervised semantic segmentation problem into a fully‑supervised one, greatly improving the segmentation accuracy. Experimental results on three datasets prove that the proposed method achieves a state‑of‑the‑art performance. Code is available at https://github.com/Vison307/HisynSeg.

Abstract:
Recently, deep learning based methods have revolutionized remote sensing image segmentation. However, these methods usually rely on a pre‑defined semantic class set, thus needing additional image annotation and model training when adapting to new classes. More importantly, they are unable to segment arbitrary semantic classes. In this work, we introduce Open‑Vocabulary Remote Sensing Image Semantic Segmentation (OVRSISS), which aims to segment arbitrary semantic classes in remote sensing images. To address the lack of OVRSISS datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images covering 40 diverse semantic classes. In addition, we propose a novel framework named GSNet that integrates domain priors from special remote sensing models and versatile capabilities of general vision‑language models. Technically, GSNet consists of a Dual‑Stream Image Encoder (DSIE), a Query‑Guided Feature Fusion (QGFF), and a Residual Information Preservation Decoder (RIPD). DSIE first captures comprehensive features from both special models and general models in dual streams. Then, with the guidance of variable vocabularies, QGFF integrates specialist and generalist features, enabling them to complement each other. Finally, RIPD is proposed to aggregate multi‑source features for more accurate mask predictions. Experiments show that our method outperforms other methods by a large margin, and our proposed LandDiscover50K improves the performance of OVRSISS methods. The proposed dataset and method will be made publicly available at https://github.com/yecy749/GSNet.

Abstract:
As the successor to the Segment Anything Model (SAM), the Segment Anything Model 2 (SAM2) not only improves performance in image segmentation but also extends its capabilities to video segmentation. However, its effectiveness in segmenting rare objects that seldom appear in videos remains underexplored. In this study, we evaluate SAM2 on three distinct video segmentation tasks: Video Shadow Detection (VSD) and Video Mirror Detection (VMD). Specifically, we use ground truth point or mask prompts to initialize the first frame and then predict corresponding masks for subsequent frames. Experimental results show that SAM2's performance on these tasks is suboptimal, especially when point prompts are used, both quantitatively and qualitatively. Code is available at \urlhttps://github.com/LeipingJie/SAM2Video

Abstract:
Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis. While CNNs excel at extracting multi‑scale features and ViTs effectively capture global dependencies, both suffer from high computational costs, particularly when processing high‑resolution images. Recently, state‑space models (SSMs) and recurrent neural networks (RNNs) have attracted attention due to their efficiency. However, their performance in image classification tasks remains limited. To address these challenges, this paper introduces VisionGRU, a novel RNN‑based architecture designed for efficient image classification. VisionGRU leverages a simplified Gated Recurrent Unit (minGRU) to process large‑scale image features with linear complexity. It divides images into smaller patches and progressively reduces the sequence length while increasing the channel depth, thus facilitating multi‑scale feature extraction. A hierarchical 2DGRU module with bidirectional scanning captures both local and global contexts, improving long‑range dependency modeling, particularly for tasks like semantic segmentation. Experimental results on the ImageNet and ADE20K datasets demonstrate that VisionGRU outperforms ViTs, significantly reducing memory usage and computational costs, especially for high‑resolution images. These findings underscore the potential of RNN‑based approaches for developing efficient and scalable computer vision solutions. Codes will be available at https://github.com/YangLiu9208/VisionGRU.

Abstract:
Few‑shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples. However, for visually intensive tasks such as few‑shot semantic segmentation, pixel‑level annotations are time‑consuming and costly. Therefore, in this paper, we utilize the more challenging image‑level annotations and propose an adaptive frequency‑aware network (AFANet) for weakly‑supervised few‑shot semantic segmentation (WFSS). Specifically, we first propose a cross‑granularity frequency‑aware module (CFM) that decouples RGB images into high‑frequency and low‑frequency distributions and further optimizes semantic structural information by realigning them. Unlike most existing WFSS methods using the textual information from the multi‑modal language‑vision model, e.g., CLIP, in an offline learning manner, we further propose a CLIP‑guided spatial‑adapter module (CSM), which performs spatial domain adaptive transformation on textual information through online learning, thus providing enriched cross‑modal semantic information for CFM. Extensive experiments on the Pascal‑5\textsuperscripti and COCO‑20\textsuperscripti datasets demonstrate that AFANet has achieved state‑of‑the‑art performance. The code is available at https://github.com/jarch‑ma/AFANet.

Abstract:
Semi‑supervised semantic segmentation has attracted considerable attention for its ability to mitigate the reliance on extensive labeled data. However, existing consistency regularization methods only utilize high certain pixels with prediction confidence surpassing a fixed threshold for training, failing to fully leverage the potential supervisory information within the network. Therefore, this paper proposes the Uncertainty‑participation Context Consistency Learning (UCCL) method to explore richer supervisory signals. Specifically, we first design the semantic backpropagation update (SBU) strategy to fully exploit the knowledge from uncertain pixel regions, enabling the model to learn consistent pixel‑level semantic information from those areas. Furthermore, we propose the class‑aware knowledge regulation (CKR) module to facilitate the regulation of class‑level semantic features across different augmented views, promoting consistent learning of class‑level semantic information within the encoder. Experimental results on two public benchmarks demonstrate that our proposed method achieves state‑of‑the‑art performance. Our code is available at https://github.com/YUKEKEJAN/UCCL.

Abstract:
When given two similar images, humans identify their differences by comparing the appearance (e.g., color, texture) with the help of semantics (e.g., objects, relations). However, mainstream binary change detection models adopt a supervised training paradigm, where the annotated binary change map is the main constraint. Thus, such methods primarily emphasize difference‑aware features between bi‑temporal images, and the semantic understanding of changed landscapes is undermined, resulting in limited accuracy in the face of noise and illumination variations. To this end, this paper explores incorporating semantic priors from visual foundation models to improve the ability to detect changes. Firstly, we propose a Semantic‑Aware Change Detection network (SA‑CDNet), which transfers the knowledge of visual foundation models (i.e., FastSAM) to change detection. Inspired by the human visual paradigm, a novel dual‑stream feature decoder is derived to distinguish changes by combining semantic‑aware features and difference‑aware features. Secondly, we explore a single‑temporal pre‑training strategy for better adaptation of visual foundation models. With pseudo‑change data constructed from single‑temporal segmentation datasets, we employ an extra branch of proxy semantic segmentation task for pre‑training. We explore various settings like dataset combinations and landscape types, thus providing valuable insights. Experimental results on five challenging benchmarks demonstrate the superiority of our method over the existing state‑of‑the‑art methods. The code is available at \hrefhttps://github.com/DREAMXFAR/SA‑CDNetgithub.

Abstract:
Colorectal cancer (CRC) remains a leading cause of cancer‑related deaths worldwide, with polyp removal being an effective early screening method. However, navigating the colon for thorough polyp detection poses significant challenges. To advance camera navigation in colonoscopy, we propose the Semantic Segmentation for Tools and Fold Edges in Colonoscopy (SegCol) Challenge. This challenge introduces a dataset from the EndoMapper repository, featuring manually annotated, pixel‑level semantic labels for colon folds and endoscopic tools across selected frames from 96 colonoscopy videos. By providing fold edges as anatomical landmarks and depth discontinuity information from both fold and tool labels, the dataset is aimed to improve depth perception and localization methods. Hosted as part of the Endovis Challenge at MICCAI 2024, SegCol aims to drive innovation in colonoscopy navigation systems. Details are available at https://www.synapse.org/Synapse:syn54124209/wiki/626563, and code resources at https://github.com/surgical‑vision/segcol_challenge .

Abstract:
Generalized few‑shot semantic segmentation (GFSS) aims to segment objects of both base and novel classes, using sufficient samples of base classes and few samples of novel classes. Representative GFSS approaches typically employ a two‑phase training scheme, involving base class pre‑training followed by novel class fine‑tuning, to learn the classifiers for base and novel classes respectively. Nevertheless, distribution gap exists between base and novel classes in this process. To narrow this gap, we exploit effective knowledge transfer from base to novel classes. First, a novel prototype modulation module is designed to modulate novel class prototypes by exploiting the correlations between base and novel classes. Second, a novel classifier calibration module is proposed to calibrate the weight distribution of the novel classifier according to that of the base classifier. Furthermore, existing GFSS approaches suffer from a lack of contextual information for novel classes due to their limited samples, we thereby introduce a context consistency learning scheme to transfer the contextual knowledge from base to novel classes. Extensive experiments on PASCAL‑5^i and COCO‑20^i demonstrate that our approach significantly enhances the state of the art in the GFSS setting. The code is available at: https://github.com/HHHHedy/GFSS‑EKT.

Abstract:
Semantic segmentation under domain shift remains a fundamental challenge in computer vision, particularly when labelled training data is scarce. This challenge is particularly exemplified in histopathology image analysis, where the same tissue structures must be segmented across images captured under different imaging conditions (stains), each representing a distinct visual domain. Traditional deep learning methods like UNet require extensive labels, which is both costly and time‑consuming, particularly when dealing with multiple domains (or stains). To mitigate this, various unsupervised domain adaptation based methods such as UDAGAN have been proposed, which reduce the need for labels by requiring only one (source) stain to be labelled. Nonetheless, obtaining source stain labels can still be challenging. This article shows that through self‑supervised pre‑training ‑‑ including SimCLR, BYOL, and a novel approach, HR‑CS‑CO ‑‑ the performance of these segmentation methods (UNet, and UDAGAN) can be retained even with 95% fewer labels. Notably, with self‑supervised pre‑training and using only 5% labels, the performance drops are minimal: 5.9% for UNet and 6.2% for UDAGAN, averaged over all stains, compared to their respective fully supervised counterparts (without pre‑training, using 100% labels). Furthermore, these findings are shown to generalise beyond their training distribution to public benchmark datasets. Implementations and pre‑trained models are publicly available \hrefhttps://github.com/zeeshannisar/resource‑effecient‑multi‑stain‑kidney‑glomeruli‑segmentation.gitonline.

Abstract:
This work proposes a novel framework, Uncertainty‑Guided Cross Attention Ensemble Mean Teacher (UG‑CEMT), for achieving state‑of‑the‑art performance in semi‑supervised medical image segmentation. UG‑CEMT leverages the strengths of co‑training and knowledge distillation by combining a Cross‑attention Ensemble Mean Teacher framework (CEMT) inspired by Vision Transformers (ViT) with uncertainty‑guided consistency regularization and Sharpness‑Aware Minimization emphasizing uncertainty. UG‑CEMT improves semi‑supervised performance while maintaining a consistent network architecture and task setting by fostering high disparity between sub‑networks. Experiments demonstrate significant advantages over existing methods like Mean Teacher and Cross‑pseudo Supervision in terms of disparity, domain generalization, and medical image segmentation performance. UG‑CEMT achieves state‑of‑the‑art results on multi‑center prostate MRI and cardiac MRI datasets, where object segmentation is particularly challenging. Our results show that using only 10% labeled data, UG‑CEMT approaches the performance of fully supervised methods, demonstrating its effectiveness in exploiting unlabeled data for robust medical image segmentation. The code is publicly available at \urlhttps://github.com/Meghnak13/UG‑CEMT

Abstract:
Although multiview fusion has demonstrated potential in LiDAR segmentation, its dependence on computationally intensive point‑based interactions, arising from the lack of fixed correspondences between views such as range view and Bird's‑Eye View (BEV), hinders its practical deployment. This paper challenges the prevailing notion that multiview fusion is essential for achieving high performance. We demonstrate that significant gains can be realized by directly fusing Polar and Cartesian partitioning strategies within the BEV space. Our proposed BEV‑only segmentation model leverages the inherent fixed grid correspondences between these partitioning schemes, enabling a fusion process that is orders of magnitude faster (170× speedup) than conventional point‑based methods. Furthermore, our approach facilitates dense feature fusion, preserving richer contextual information compared to sparse point‑based alternatives. To enhance scene understanding while maintaining inference efficiency, we also introduce a hybrid Transformer‑CNN architecture. Extensive evaluation on the SemanticKITTI and nuScenes datasets provides compelling evidence that our method outperforms previous multiview fusion approaches in terms of both performance and inference speed, highlighting the potential of BEV‑based fusion for LiDAR segmentation. Code is available at \urlhttps://github.com/skyshoumeng/PC‑BEV.

Abstract:
Boosted by Multi‑modal Large Language Models (MLLMs), text‑guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end‑to‑end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object‑aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision‑guided multi‑granularity text fusion to better integrate global and detailed text information with fine‑grained visual guidance. By leveraging multi‑task and end‑to‑end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM‑based methods with a single model. Our code is available at https://github.com/congvvc/InstructSeg.

Abstract:
Intelligent robots need to interact with diverse objects across various environments. The appearance and state of objects frequently undergo complex transformations depending on the object properties, e.g., phase transitions. However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real‑world objects based on their visual characteristics and potential morphological and appearance changes. Then, we present a new benchmark, Multi‑Phase, Multi‑Transition, and Multi‑Scenery Video Object Segmentation (M^3‑VOS), to verify the ability of models to understand object phases, which consists of 479 high‑resolution videos spanning over 10 distinct everyday scenarios. It provides dense instance mask annotations that capture both object phases and their transitions. We evaluate state‑of‑the‑art methods on M^3‑VOS, yielding several key insights. Notably, current appearance‑based approaches show significant room for improvement when handling objects with phase transitions. The inherent changes in disorder suggest that the predictive performance of the forward entropy‑increasing process can be improved through a reverse entropy‑reducing process. These findings lead us to propose ReVOS, a new plug‑andplay model that improves its performance by reversal refinement. Our data and code will be publicly available at https://zixuan‑chen.github.io/M‑cube‑VOS.github.io/.

Abstract:
3D open‑vocabulary scene understanding, which accurately perceives complex semantic properties of objects in space, has gained significant attention in recent years. In this paper, we propose GAGS, a framework that distills 2D CLIP features into 3D Gaussian splatting, enabling open‑vocabulary queries for renderings on arbitrary viewpoints. The main challenge of distilling 2D features for 3D fields lies in the multiview inconsistency of extracted 2D features, which provides unstable supervision for the 3D feature field. GAGS addresses this challenge with two novel strategies. First, GAGS associates the prompt point density of SAM with the camera distances, which significantly improves the multiview consistency of segmentation results. Second, GAGS further decodes a granularity factor to guide the distillation process and this granularity factor can be learned in a unsupervised manner to only select the multiview consistent 2D features in the distillation process. Experimental results on two datasets demonstrate significant performance and stability improvements of GAGS in visual grounding and semantic segmentation, with an inference speed 2× faster than baseline methods. The code and additional results are available at https://pz0826.github.io/GAGS‑Webpage/ .

Abstract:
Robustness and generalizability in medical image segmentation are often hindered by scarcity and limited diversity of training data, which stands in contrast to the variability encountered during inference. While conventional strategies ‑‑ such as domain‑specific augmentation, specialized architectures, and tailored training procedures ‑‑ can alleviate these issues, they depend on the availability and reliability of domain knowledge. When such knowledge is unavailable, misleading, or improperly applied, performance may deteriorate. In response, we introduce a novel, domain‑agnostic, add‑on, and data‑driven strategy inspired by image stacking in image denoising. Termed ``semantic stacking,'' our method estimates a denoised semantic representation that complements the conventional segmentation loss during training. This method does not depend on domain‑specific assumptions, making it broadly applicable across diverse image modalities, model architectures, and augmentation techniques. Through extensive experiments, we validate the superiority of our approach in improving segmentation performance under diverse conditions. Code is available at https://github.com/ymp5078/Semantic‑Stacking.

Abstract:
Event‑based semantic segmentation has great potential in autonomous driving and robotics due to the advantages of event cameras, such as high dynamic range, low latency, and low power cost. Unfortunately, current artificial neural network (ANN)‑based segmentation methods suffer from high computational demands, the requirements for image frames, and massive energy consumption, limiting their efficiency and application on resource‑constrained edge/mobile platforms. To address these problems, we introduce SLTNet, a spike‑driven lightweight transformer‑based network designed for event‑based semantic segmentation. Specifically, SLTNet is built on efficient spike‑driven convolution blocks (SCBs) to extract rich semantic features while reducing the model's parameters. Then, to enhance the long‑range contextural feature interaction, we propose novel spike‑driven transformer blocks (STBs) with binary mask operations. Based on these basic blocks, SLTNet employs a high‑efficiency single‑branch architecture while maintaining the low energy consumption of the Spiking Neural Network (SNN). Finally, extensive experiments on DDD17 and DSEC‑Semantic datasets demonstrate that SLTNet outperforms state‑of‑the‑art (SOTA) SNN‑based methods by at most 9.06% and 9.39% mIoU, respectively, with extremely 4.58x lower energy consumption and 114 FPS inference speed. Our code is open‑sourced and available at https://github.com/longxianlei/SLTNet‑v1.0.

Abstract:
Instance segmentation algorithms in remote sensing are typically based on conventional methods, limiting their application to seen scenarios and closed‑set predictions. In this work, we propose a novel task called zero‑shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data. Challenges arise when classifying aerial categories with high inter‑class similarity and intra‑class variance. Besides, the domain gap between vision‑language models' pretraining datasets and remote sensing datasets hinders the zero‑shot capabilities of the pretrained model when it is directly applied to remote sensing images. To address these challenges, we propose a Zero‑Shot Remote Sensing Instance Segmentation framework, dubbed ZoRI. Our approach features a discrimination‑enhanced classifier that uses refined textual embeddings to increase the awareness of class disparities. Instead of direct fine‑tuning, we propose a knowledge‑maintained adaptation strategy that decouples semantic‑related information to preserve the pretrained vision‑language alignment while adjusting features to capture remote sensing domain‑specific visual cues. Additionally, we introduce a prior‑injected prediction with cache bank of aerial visual prototypes to supplement the semantic richness of text embeddings and seamlessly integrate aerial representations, adapting to the remote sensing domain. We establish new experimental protocols and benchmarks, and extensive experiments convincingly demonstrate that ZoRI achieves the state‑of‑art performance on the zero‑shot remote sensing instance segmentation task. Our code is available at https://github.com/HuangShiqi128/ZoRI.

Abstract:
Class incremental semantic segmentation (CISS) aims to segment new classes during continual steps while preventing the forgetting of old knowledge. Existing methods alleviate catastrophic forgetting by replaying distributions of previously learned classes using stored prototypes or features. However, they overlook a critical issue: in CISS, the representation of class knowledge is updated continuously through incremental learning, whereas prototype replay methods maintain fixed prototypes. This mismatch between updated representation and fixed prototypes limits the effectiveness of the prototype replay strategy. To address this issue, we propose the Adaptive prototype replay (Adapter) for CISS in this paper. Adapter comprises an adaptive deviation compen sation (ADC) strategy and an uncertainty‑aware constraint (UAC) loss. Specifically, the ADC strategy dynamically updates the stored prototypes based on the estimated representation shift distance to match the updated representation of old class. The UAC loss reduces prediction uncertainty, aggregating discriminative features to aid in generating compact prototypes. Additionally, we introduce a compensation‑based prototype similarity discriminative (CPD) loss to ensure adequate differentiation between similar prototypes, thereby enhancing the efficiency of the adaptive prototype replay strategy. Extensive experiments on Pascal VOC and ADE20K datasets demonstrate that Adapter achieves state‑of‑the‑art results and proves effective across various CISS tasks, particularly in challenging multi‑step scenarios. The code and model is available at https://github.com/zhu‑gl‑ux/Adapter.

Abstract:
Domain Generalized Semantic Segmentation (DGSS) seeks to utilize source domain data exclusively to enhance the generalization of semantic segmentation across unknown target domains. Prevailing studies predominantly concentrate on feature normalization and domain randomization, these approaches exhibit significant limitations. Feature normalization‑based methods tend to confuse semantic features in the process of constraining the feature space distribution, resulting in classification misjudgment. Domain randomization‑based methods frequently incorporate domain‑irrelevant noise due to the uncontrollability of style transformations, resulting in segmentation ambiguity. To address these challenges, we introduce a novel framework, named SCSD for Semantic Consistency prediction and Style Diversity generalization. It comprises three pivotal components: Firstly, a Semantic Query Booster is designed to enhance the semantic awareness and discrimination capabilities of object queries in the mask decoder, enabling cross‑domain semantic consistency prediction. Secondly, we develop a Text‑Driven Style Transform module that utilizes domain difference text embeddings to controllably guide the style transformation of image features, thereby increasing inter‑domain style diversity. Lastly, to prevent the collapse of similar domain feature spaces, we introduce a Style Synergy Optimization mechanism that fortifies the separation of inter‑domain features and the aggregation of intra‑domain features by synergistically weighting style contrastive loss and style aggregation loss. Extensive experiments demonstrate that the proposed SCSD significantly outperforms existing state‑of‑theart methods. Notably, SCSD trained on GTAV achieved an average of 49.11 mIoU on the four unseen domain datasets, surpassing the previous state‑of‑the‑art method by +4.08 mIoU. Code is available at https://github.com/nhw649/SCSD.

Abstract:
High‑quality semantic segmentation relies on three key capabilities: global context modeling, local detail encoding, and multi‑scale feature extraction. However, recent methods struggle to possess all these capabilities simultaneously. Hence, we aim to empower segmentation networks to simultaneously carry out efficient global context modeling, high‑quality local detail encoding, and rich multi‑scale feature representation for varying input resolutions. In this paper, we introduce SegMAN, a novel linear‑time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. Specifically, the SegMAN Encoder synergistically integrates sliding local attention with dynamic state space models, enabling highly efficient global context modeling while preserving fine‑grained local details. Meanwhile, the MMSCopE module in our decoder enhances multi‑scale context feature extraction and adaptively scales with the input resolution. Our SegMAN‑B Encoder achieves 85.1% ImageNet‑1k accuracy (+1.5% over VMamba‑S with fewer parameters). When paired with our decoder, the full SegMAN‑B model achieves 52.6% mIoU on ADE20K (+1.6% over SegNeXt‑L with 15% fewer GFLOPs), 83.8% mIoU on Cityscapes (+2.1% over SegFormer‑B3 with half the GFLOPs), and 1.6% higher mIoU than VWFormer‑B3 on COCO‑Stuff with lower GFLOPs. Our code is available at https://github.com/yunxiangfu2001/SegMAN.

Abstract:
Vulnerability to adversarial attacks is a well‑known deficiency of deep neural networks. Larger networks are generally more robust, and ensembling is one method to increase adversarial robustness: each model's weaknesses are compensated by the strengths of others. While an ensemble uses a deterministic rule to combine model outputs, a mixture of experts (MoE) includes an additional learnable gating component that predicts weights for the outputs of the expert models, thus determining their contributions to the final prediction. MoEs have been shown to outperform ensembles on specific tasks, yet their susceptibility to adversarial attacks has not been studied yet. In this work, we evaluate the adversarial vulnerability of MoEs for semantic segmentation of urban and highway traffic scenes. We show that MoEs are, in most cases, more robust to per‑instance and universal white‑box adversarial attacks and can better withstand transfer attacks. Our code is available at \urlhttps://github.com/KASTEL‑MobilityLab/mixtures‑of‑experts/.

Abstract:
Archaeological pottery documentation and study represents a crucial but time‑consuming aspect of archaeology. While recent years have seen advances in digital documentation methods, vast amounts of legacy data remain locked in traditional publications. This paper introduces PyPotteryLens, an open‑source framework that leverages deep learning to automate the digitisation and processing of archaeological pottery drawings from published sources. The system combines state‑of‑the‑art computer vision models (YOLO for instance segmentation and EfficientNetV2 for classification) with an intuitive user interface, making advanced digital methods accessible to archaeologists regardless of technical expertise. The framework achieves over 97% precision and recall in pottery detection and classification tasks, while reducing processing time by up to 5x to 20x compared to manual methods. Testing across diverse archaeological contexts demonstrates robust generalisation capabilities. Also, the system's modular architecture facilitates extension to other archaeological materials, while its standardised output format ensures long‑term preservation and reusability of digitised data as well as solid basis for training machine learning algorithms. The software, documentation, and examples are available on GitHub (https://github.com/lrncrd/PyPottery/tree/PyPotteryLens).

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) with image‑level labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class‑patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class‑patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class‑patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization‑informed Regularization module to explicitly regularize the class‑patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments are conducted on PASCAL VOC and MS COCO, validating that MoRe effectively addresses the artifact issue and achieves state‑of‑the‑art performance, surpassing recent single‑stage and even multi‑stage methods. Code is available at https://github.com/zwyang6/MoRe.

Abstract:
Semantic segmentation suffers from significant performance degradation when the trained network is applied to a different domain. To address this issue, unsupervised domain adaptation (UDA) has been extensively studied. Despite the effectiveness of selftraining techniques in UDA, they still overlook the explicit modeling of domain‑shared feature extraction. In this paper, we propose DiDA, an unsupervised domain bridging approach for semantic segmentation. DiDA consists of two key modules: (1) Degradation‑based Intermediate Domain Construction, which creates continuous intermediate domains through simple image degradation operations to encourage learning domain‑invariant features as domain differences gradually diminish; (2) Semantic Shift Compensation, which leverages a diffusion encoder to disentangle and compensate for semantic shift information with degraded timesteps, preserving discriminative representations in the intermediate domains. As a plug‑and‑play solution, DiDA supports various degradation operations and seamlessly integrates with existing UDA methods. Extensive experiments on multiple domain adaptive semantic segmentation benchmarks demonstrate that DiDA consistently achieves significant performance improvements across all settings. Code is available at https://github.com/Woof6/DiDA.

Abstract:
Existing methods enhance the training of detection transformers by incorporating an auxiliary one‑to‑many assignment. In this work, we treat the model as a multi‑task framework, simultaneously performing one‑to‑one and one‑to‑many predictions. We investigate the roles of each component in the transformer decoder across these two training targets, including self‑attention, cross‑attention, and feed‑forward network. Our empirical results demonstrate that any independent component in the decoder can effectively learn both targets simultaneously, even when other components are shared. This finding leads us to propose a multi‑route training mechanism, featuring a primary route for one‑to‑one prediction and two auxiliary training routes for one‑to‑many prediction. We propose a novel instructive self‑attention mechanism, integrated into the first auxiliary route, which dynamically and flexibly guides object queries for one‑to‑many prediction. For the second auxiliary route, we introduce a route‑aware Mixture‑of‑Experts (MoE) to facilitate knowledge sharing while mitigating potential conflicts between routes. Additionally, we apply an MoE to low‑scale features in the encoder, optimizing the balance between efficiency and effectiveness. The auxiliary routes are discarded during inference. We conduct extensive experiments across various object detection baselines, achieving consistent improvements as demonstrated in Fig. 1. Our method is highly flexible and can be readily adapted to other tasks. To demonstrate its versatility, we conduct experiments on both instance segmentation and panoptic segmentation, further validating its effectiveness. Project page: https://visual‑ai.github.io/mrdetr/

Abstract:
Recent advances in multimodal large language models (MLLMs) have expanded research in video understanding, primarily focusing on high‑level tasks such as video captioning and question‑answering. Meanwhile, a smaller body of work addresses dense, pixel‑precise segmentation tasks, which typically involve category‑guided or referral‑based object segmentation. Although both directions are essential for developing models with human‑level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human‑written captions and temporally consistent, pixel‑accurate masks for multiple objects with phrase grounding. Our benchmark evaluates models on both holistic/high‑level understanding and language‑guided, pixel‑precise segmentation. We also present carefully validated evaluation measures and propose an effective model architecture that can tackle our benchmark. Project page: https://ali2500.github.io/vicas‑project/

Abstract:
Face parsing refers to the semantic segmentation of human faces into key facial regions such as eyes, nose, hair, etc. It serves as a prerequisite for various advanced applications, including face editing, face swapping, and facial makeup, which often require segmentation masks for classes like eyeglasses, hats, earrings, and necklaces. These infrequently occurring classes are called long‑tail classes, which are overshadowed by more frequently occurring classes known as head classes. Existing methods, primarily CNN‑based, tend to be dominated by head classes during training, resulting in suboptimal representation for long‑tail classes. Previous works have largely overlooked the problem of poor segmentation performance of long‑tail classes. To address this issue, we propose SegFace, a simple and efficient approach that uses a lightweight transformer‑based model which utilizes learnable class‑specific tokens. The transformer decoder leverages class‑specific tokens, allowing each token to focus on its corresponding class, thereby enabling independent modeling of each class. The proposed approach improves the performance of long‑tail classes, thereby boosting overall performance. To the best of our knowledge, SegFace is the first work to employ transformer models for face parsing. Moreover, our approach can be adapted for low‑compute edge devices, achieving 95.96 FPS. We conduct extensive experiments demonstrating that SegFace significantly outperforms previous state‑of‑the‑art models, achieving a mean F1 score of 88.96 (+2.82) on the CelebAMask‑HQ dataset and 93.03 (+0.65) on the LaPa dataset. Code: https://github.com/Kartik‑3004/SegFace

Abstract:
Generating detailed captions comprehending text‑rich visual content in images has received growing attention for Large Vision‑Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed captions to measure their accuracy and comprehensiveness. In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. Concretely, we first manually segment the image into semantically meaningful regions (i.e., semantic segmentation mask) according to common‑object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image. Based on our directed scene graph, we develop a pipeline to assess the generated detailed captions from LVLMs on multiple levels, including the object‑level coverage, the accuracy of attribute descriptions, the score of key relationships, etc. Experimental results on the CompreCap dataset confirm that our evaluation method aligns closely with human evaluation scores across LVLMs.

Abstract:
Camera‑based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi‑SOP). Hi‑SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local‑global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes‑Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset. The project website is available at https://arlo0o.github.io/hisop.github.io/.

Abstract:
In this work, we focus on semi‑supervised learning for video action detection. Video action detection requires spatiotemporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end‑to‑end teacher‑based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel Error Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this knowledge to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatiotemporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To address this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, leading to coherent temporal detections. We evaluate our approach on four different spatiotemporal detection benchmarks: UCF101‑24, JHMDB21, AVA, and YouTube‑VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101‑24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides competitive performance compared to the supervised baseline trained on 100% annotations on UCF101‑24 and JHMDB21, respectively. We further evaluate its effectiveness on AVA for scaling to large‑scale datasets and YouTube‑VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain. Code and models are publicly available.

Abstract:
How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame‑level anomaly prediction, often missing the interpretability of complex and diverse real‑world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short‑term and long‑term anomalies. To address this challenge, we introduce HIVAU‑70k, a large‑scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi‑automated annotation engine that efficiently scales high‑quality annotations by combining manual video segmentation with recursive free‑text annotation using large language models (LLMs). This results in over 70,000 multi‑granular annotations organized at clip‑level, event‑level, and video‑level segments. For efficient anomaly detection in long videos, we propose the Anomaly‑focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density‑aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly‑rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual‑language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https://github.com/pipixin321/HolmesVAU.

Abstract:
Remote Sensing Vision‑Language Models (RS VLMs) have made much progress in the tasks of remote sensing (RS) image comprehension. While performing well in multi‑modal reasoning and multi‑turn conversations, the existing models lack pixel‑level understanding and struggle with multi‑image inputs. In this work, we propose RSUniVLM, a unified, end‑to‑end RS VLM designed for comprehensive vision understanding across multiple granularity, including image‑level, region‑level, and pixel‑level tasks. RSUniVLM also performs effectively in multi‑image analysis, with instances of change detection and change captioning. To enhance the model's ability to capture visual information at different levels without increasing model size, we design a novel architecture called Granularity‑oriented Mixture of Experts to constraint the model to about 1 billion parameters. We also construct a large‑scale RS instruction‑following dataset based on a variety of existing datasets in both RS and general domain, encompassing various tasks such as object localization, visual question answering, and semantic segmentation. Substantial experiments have been conducted to validate the superiority of the proposed RSUniVLM up to state‑of‑the‑art across various RS tasks. Code and model will be available at \hrefhttps://github.com/xuliu‑cyber/RSUniVLMhere.

Abstract:
Video colour editing is a crucial task for content creation, yet existing solutions either require painstaking frame‑by‑frame manipulation or produce unrealistic results with temporal artefacts. We present a practical, training‑free framework that makes precise video colour editing accessible through an intuitive interface while maintaining professional‑quality output. Our key insight is that by decoupling spatial and temporal aspects of colour editing, we can better align with users' natural workflow ‑‑ allowing them to focus on precise colour selection in key frames before automatically propagating changes across time. We achieve this through a novel technical framework that combines: (i) a simple point‑and‑click interface merging grid‑based colour selection with automatic instance segmentation for precise spatial control, (ii) bidirectional colour propagation that leverages inherent video motion patterns, and (iii) motion‑aware blending that ensures smooth transitions even with complex object movements. Through extensive evaluation on diverse scenarios, we demonstrate that our approach matches or exceeds state‑of‑the‑art methods while eliminating the need for training or specialized hardware, making professional‑quality video colour editing accessible to everyone.

Abstract:
How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment methods do not directly translate to how these models are used in practical vision‑language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision‑language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision‑language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image‑text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero‑shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero‑shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language‑compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open‑source: https://lezhang7.github.io/sail.github.io/

Abstract:
In this paper, we address the challenge of performing open‑vocabulary video instance segmentation (OV‑VIS) in real‑time. We analyze the computational bottlenecks of state‑of‑the‑art foundation models that performs OV‑VIS, and propose a new method, TROY‑VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY‑VIS achieves the best trade‑off between accuracy and speed on two large‑scale OV‑VIS benchmarks, BURST and LV‑VIS, running 20x faster than GLEE‑Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY‑VIS's potential for real‑time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at https://github.com/google‑research/troyvis.

Abstract:
Domain generalization (DG) aims to adapt a model using one or multiple source domains to ensure robust performance in unseen target domains. Recently, Parameter‑Efficient Fine‑Tuning (PEFT) of foundation models has shown promising results in the context of DG problem. Nevertheless, existing PEFT methods still struggle to strike a balance between preserving generalizable components of the pre‑trained model and learning task‑specific features. To gain insights into the distribution of generalizable components, we begin by analyzing the pre‑trained weights through the lens of singular value decomposition. Building on these insights, we introduce Singular Value Decomposed Minor Components Adaptation (SoMA), an approach that selectively tunes minor singular components while keeping the residual parts frozen. SoMA effectively retains the generalization ability of the pre‑trained model while efficiently acquiring task‑specific skills. Moreover, we freeze domain‑generalizable blocks and employ an annealing weight decay strategy, thereby achieving an optimal balance in the delicate trade‑off between generalizability and discriminability. SoMA attains state‑of‑the‑art results on multiple benchmarks that span both domain generalized semantic segmentation to domain generalized object detection. In addition, our methods introduce no additional inference overhead or regularization loss, maintain compatibility with any backbone or head, and are designed to be versatile, allowing easy integration into a wide range of tasks.

Abstract:
Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel‑level masks is exceptionally complex and time‑consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image‑level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space‑time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class‑relative regions. Besides, we leverage the temporal‑to‑class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space‑time perceptive clues, we derive the clue‑based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact‑generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.

Abstract:
As the deployment of artifical intelligence (AI) algorithms at edge devices becomes increasingly prevalent, enhancing the robustness and reliability of autonomous AI‑based perception and decision systems is becoming as relevant as precision and performance, especially in applications areas considered safety‑critical such as autonomous driving and aerospace. This paper delves into the robustness assessment in embedded Deep Neural Networks (DNNs), particularly focusing on the impact of parameter perturbations produced by single event upsets (SEUs) on convolutional neural networks (CNN) for image semantic segmentation. By scrutinizing the layer‑by‑layer and bit‑by‑bit sensitivity of various encoder‑decoder models to soft errors, this study thoroughly investigates the vulnerability of segmentation DNNs to SEUs and evaluates the consequences of techniques like model pruning and parameter quantization on the robustness of compressed models aimed at embedded implementations. The findings offer valuable insights into the mechanisms underlying SEU‑induced failures that allow for evaluating the robustness of DNNs once trained in advance. Moreover, based on the collected data, we propose a set of practical lightweight error mitigation techniques with no memory or computational cost suitable for resource‑constrained deployments. The code used to perform the fault injection (FI) campaign is available at https://github.com/jonGuti13/TensorFI2 , while the code to implement proposed techniques is available at https://github.com/jonGuti13/parameterProtection .

Abstract:
CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine‑grained Language‑informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub‑captions that describe fine‑grained details about an image, we train our vision‑language model to produce not only global embeddings but also text‑specific image representations. Our model introduces text‑conditioned attention pooling on top of local image tokens to produce fine‑grained image representations that excel at retrieving detailed image content. We achieve state‑of‑the‑art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine‑grained retrieval task which evaluates vision‑language models' ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image‑text pairs in capturing fine‑grained visual information, including zero‑shot semantic segmentation, outperforming models trained on billions of pairs. Code is available at https://github.com/ExplainableML/flair .

Abstract:
We propose a novel bio‑inspired semi‑supervised learning approach for training downsampling‑upsampling semantic segmentation architectures. The first stage does not use backpropagation. Rather, it exploits the Hebbian principle ``fire together, wire together'' as a local learning rule for updating the weights of both convolutional and transpose‑convolutional layers, allowing unsupervised discovery of data features. In the second stage, the model is fine‑tuned with standard backpropagation on a small subset of labeled data. We evaluate our methodology through experiments conducted on several widely used biomedical datasets, deeming that this domain is paramount in computer vision and is notably impacted by data scarcity. Results show that our proposed method outperforms SOTA approaches across different levels of label availability. Furthermore, we show that using our unsupervised stage to initialize the SOTA approaches leads to performance improvements. The code to replicate our experiments can be found at https://github.com/ciampluca/hebbian‑bootstraping‑semi‑supervised‑medical‑imaging

Abstract:
Due to their large sizes, volumetric scans and whole‑slide pathology images (WSIs) are often processed by extracting embeddings from local regions and then an aggregator makes predictions from this set. However, current methods require post‑hoc visualization techniques (e.g., Grad‑CAM) and often fail to localize small yet clinically crucial details. To address these limitations, we introduce INSIGHT, a novel weakly‑supervised aggregator that integrates heatmap generation as an inductive bias. Starting from pre‑trained feature maps, INSIGHT employs a detection module with small convolutional kernels to capture fine details and a context module with a broader receptive field to suppress local false positives. The resulting internal heatmap highlights diagnostically relevant regions. On CT and WSI benchmarks, INSIGHT achieves state‑of‑the‑art classification results and high weakly‑labeled semantic segmentation performance. Project website and code are available at: https://zhangdylan83.github.io/ewsmia/

Abstract:
Vision‑Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs‑MOdality Self‑distillation for vision‑language pre‑training that integrates a novel text‑cropping strategy and cross‑attention module into a self‑supervised learning framework. We create global and local views of images and texts (i.e., multi‑modal augmentations), which are essential for self‑distillation in VLMs. We further introduce a cross‑attention module, enabling COSMOS to learn comprehensive cross‑modal representations optimized via a cross‑modality self‑distillation loss. COSMOS consistently outperforms previous strong baselines on various zero‑shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP‑based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

Abstract:
The creation of 3D scenes has traditionally been both labor‑intensive and costly, requiring designers to meticulously configure 3D assets and environments. Recent advancements in generative AI, including text‑to‑3D and image‑to‑3D methods, have dramatically reduced the complexity and cost of this process. However, current techniques for editing complex 3D scenes continue to rely on generally interactive multi‑step, 2D‑to‑3D projection methods and diffusion‑based techniques, which often lack precision in control and hamper interactive‑rate performance. In this work, we propose 3DSceneEditor, a fully 3D‑based paradigm for interactive‑rate, precise editing of intricate 3D scenes using Gaussian Splatting. Unlike conventional methods, 3DSceneEditor operates through a streamlined 3D pipeline, enabling direct Gaussian‑based manipulation for efficient, high‑quality edits based on input prompts. The proposed framework (i) integrates a pre‑trained instance segmentation model for semantic labeling; (ii) employs a zero‑shot grounding approach with CLIP to align target objects with user prompts; and (iii) applies scene modifications, such as object addition, repositioning, recoloring, replacing, and removal‑‑directly on Gaussians. Extensive experimental results show that 3DSceneEditor surpasses existing state‑of‑the‑art techniques in terms of both editing precision and efficiency, establishing a new benchmark for efficient and interactive 3D scene customization.

Abstract:
Human pose estimation methods work well on isolated people but struggle with multiple‑bodies‑in‑proximity scenarios. Previous work has addressed this problem by conditioning pose estimation by detected bounding boxes or keypoints, but overlooked instance masks. We propose to iteratively enforce mutual consistency of bounding boxes, instance masks, and poses. The introduced BBox‑Mask‑Pose (BMP) method uses three specialized models that improve each other's output in a closed loop. All models are adapted for mutual conditioning, which improves robustness in multi‑body scenes. MaskPose, a new mask‑conditioned pose estimation model, is the best among top‑down approaches on OCHuman. BBox‑Mask‑Pose pushes SOTA on OCHuman dataset in all three tasks ‑ detection, instance segmentation, and pose estimation. It also achieves SOTA performance on COCO pose estimation. The method is especially good in scenes with large instances overlap, where it improves detection by 39% over the baseline detector. With small specialized models and faster runtime, BMP is an effective alternative to large human‑centered foundational models. Code and models are available on https://MiraPurkrabek.github.io/BBox‑Mask‑Pose.

Abstract:
Current benchmarks for video segmentation are limited to annotating only salient objects (i.e., foreground instances). Despite their impressive architectural designs, previous works trained on these benchmarks have struggled to adapt to real‑world scenarios. Thus, developing a new video segmentation dataset aimed at tracking multi‑granularity segmentation target in the video scene is necessary. In this work, we aim to generate multi‑granularity video segmentation dataset that is annotated for both salient and non‑salient masks. To achieve this, we propose a large‑scale, densely annotated multi‑granularity video object segmentation (MUG‑VOS) dataset that includes various types and granularities of mask annotations. We automatically collected a training set that assists in tracking both salient and non‑salient objects, and we also curated a human‑annotated test set for reliable evaluation. In addition, we present memory‑based mask propagation model (MMPM), trained and evaluated on MUG‑VOS dataset, which leads to the best performance among the existing video object segmentation methods and Segment SAM‑based video segmentation methods. Project page is available at https://cvlab‑kaist.github.io/MUG‑VOS.

Abstract:
Handling occlusion remains a significant challenge for video instance‑level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal‑Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal‑prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.

Abstract:
Referring video object segmentation (RVOS) requires tracking and segmenting an object throughout a video according to a given natural language expression, demanding both complex motion understanding and the alignment of visual representations with language descriptions. Given these challenges, the recently proposed Segment Anything Model 2 (SAM2) emerges as a potential candidate due to its ability to generate coherent segmentation mask tracks across video frames, and provide an inherent spatio‑temporal objectness in its object token representations. In this paper, we introduce SOLA (Selection by Object Language Alignment), a novel framework that leverages SAM2 object tokens as compact video‑level object representations, which are aligned with language features through a lightweight track selection module. To effectively facilitate this alignment, we propose an IoU‑based pseudo‑labeling strategy, which bridges the modality gap between SAM2 representations with language features. Extensive experiments show that SOLA achieves state‑of‑the‑art performance on the MeViS dataset and demonstrate that SOLA offers an effective solution for RVOS. Our project page is available at: https://cvlab‑kaist.github.io/SOLA.

Abstract:
Recent DETR‑based methods have advanced the development of Video Instance Segmentation (VIS) through transformers' efficiency and capability in modeling spatial and temporal information. Despite harvesting remarkable progress, existing works follow asynchronous designs, which model video sequences via either video‑level queries only or adopting query‑sensitive cascade structures, resulting in difficulties when handling complex and challenging video scenarios. In this work, we analyze the cause of this phenomenon and the limitations of the current solutions, and propose to conduct synchronized modeling via a new framework named SyncVIS. Specifically, SyncVIS explicitly introduces video‑level query embeddings and designs two key modules to synchronize video‑level query with frame‑level query embeddings: a synchronized video‑frame modeling paradigm and a synchronized embedding optimization strategy. The former attempts to promote the mutual learning of frame‑ and video‑level embeddings with each other and the latter divides large video sequences into small clips for easier optimization. Extensive experimental evaluations are conducted on the challenging YouTube‑VIS 2019 & 2021 & 2022, and OVIS benchmarks and SyncVIS achieves state‑of‑the‑art results, which demonstrates the effectiveness and generality of the proposed approach. The code is available at https://github.com/rkzheng99/SyncVIS.

Abstract:
Efficiently modeling large 2D contexts is essential for various fields including Giga‑Pixel Whole Slide Imaging (WSI) and remote sensing. Transformer‑based models offer high parallelism but face challenges due to their quadratic complexity for handling long sequences. Recently, Mamba introduced a selective State Space Model (SSM) with linear complexity and high parallelism, enabling effective and efficient modeling of wide context in 1D sequences. However, extending Mamba to vision tasks, which inherently involve 2D structures, results in spatial discrepancies due to the limitations of 1D sequence processing. On the other hand, current 2D SSMs inherently model 2D structures but they suffer from prohibitively slow computation due to the lack of efficient parallel algorithms. In this work, we propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba, with a highly optimized hardware‑aware operator, adopting both spatial continuity and computational efficiency. We validate the versatility of our approach on both WSIs and natural images. Extensive experiments on 10 public datasets for WSI classification and survival analysis show that 2DMamba improves up to 2.48% in AUC, 3.11% in F1 score, 2.47% in accuracy and 5.52% in C‑index. Additionally, integrating our method with VMamba for natural imaging yields 0.5 to 0.7 improvements in mIoU on the ADE20k semantic segmentation dataset, and 0.2% accuracy improvement on ImageNet‑1K classification dataset. Our code is available at https://github.com/AtlasAnalyticsLab/2DMamba.

Abstract:
We propose TAROT, a targeted data selection framework grounded in optimal transport theory. Previous targeted data selection methods primarily rely on influence‑based greedy heuristics to enhance domain‑specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, these heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary factors contributing to this limitation: (i) the disproportionate impact of dominant feature components in high‑dimensional influence estimation, and (ii) the restrictive linear additive assumptions inherent in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, providing a more reliable measure of data influence. Building on this, TAROT uses whitened feature distance to quantify and minimize the optimal transport distance between the selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state‑of‑the‑art methods, highlighting its versatility across various deep learning tasks. Code is available at https://github.com/vita‑epfl/TAROT.

Abstract:
Feature upsampling is an essential operation in constructing deep convolutional neural networks. However, existing upsamplers either lack specific feature guidance or necessitate the utilization of high‑resolution feature maps, resulting in a loss of performance and flexibility. In this paper, we find that the local self‑attention naturally has the feature guidance capability, and its computational paradigm aligns closely with the essence of feature upsampling (\ie feature reassembly of neighboring points). Therefore, we introduce local self‑attention into the upsampling task and demonstrate that the majority of existing upsamplers can be regarded as special cases of upsamplers based on local self‑attention. Considering the potential semantic gap between upsampled points and their neighboring points, we further introduce the deformation mechanism into the upsampler based on local self‑attention, thereby proposing LDA‑AQU. As a novel dynamic kernel‑based upsampler, LDA‑AQU utilizes the feature of queries to guide the model in adaptively adjusting the position and aggregation weight of neighboring points, thereby meeting the upsampling requirements across various complex scenarios. In addition, LDA‑AQU is lightweight and can be easily integrated into various model architectures. We evaluate the effectiveness of LDA‑AQU across four dense prediction tasks: object detection, instance segmentation, panoptic segmentation, and semantic segmentation. LDA‑AQU consistently outperforms previous state‑of‑the‑art upsamplers, achieving performance enhancements of 1.7 AP, 1.5 AP, 2.0 PQ, and 2.5 mIoU compared to the baseline models in the aforementioned four tasks, respectively. Code is available at \urlhttps://github.com/duzw9311/LDA‑AQU.

Abstract:
Injecting semantics into 3D Gaussian Splatting (3DGS) has recently garnered significant attention. While current approaches typically distill 3D semantic features from 2D foundational models (e.g., CLIP and SAM) to facilitate novel view segmentation and semantic understanding, their heavy reliance on 2D supervision can undermine cross‑view semantic consistency and necessitate complex data preparation processes, therefore hindering view‑consistent scene understanding. In this work, we present FreeGS, an unsupervised semantic‑embedded 3DGS framework that achieves view‑consistent 3D scene understanding without the need for 2D labels. Instead of directly learning semantic features, we introduce the IDentity‑coupled Semantic Field (IDSF) into 3DGS, which captures both semantic representations and view‑consistent instance indices for each Gaussian. We optimize IDSF with a two‑step alternating strategy: semantics help to extract coherent instances in 3D space, while the resulting instances regularize the injection of stable semantics from 2D space. Additionally, we adopt a 2D‑3D joint contrastive loss to enhance the complementarity between view‑consistent 3D geometry and rich semantics during the bootstrapping process, enabling FreeGS to uniformly perform tasks such as novel‑view semantic segmentation, object selection, and 3D object detection. Extensive experiments on LERF‑Mask, 3D‑OVS, and ScanNet datasets demonstrate that FreeGS performs comparably to state‑of‑the‑art methods while avoiding the complex data preprocessing workload. Our code is publicly available at https://github.com/wb014/FreeGS.

Abstract:
3D scene understanding has become an essential area of research with applications in autonomous driving, robotics, and augmented reality. Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful approach, combining explicit modeling with neural adaptability to provide efficient and detailed scene representations. However, three major challenges remain in leveraging 3DGS for scene understanding: 1) an imbalance between appearance and semantics, where dense Gaussian usage for fine‑grained texture modeling does not align with the minimal requirements for semantic attributes; 2) inconsistencies between appearance and semantics, as purely appearance‑based Gaussians often misrepresent object boundaries; and 3) reliance on top‑down instance segmentation methods, which struggle with uneven category distributions, leading to over‑ or under‑segmentation. In this work, we propose InstanceGaussian, a method that jointly learns appearance and semantic features while adaptively aggregating instances. Our contributions include: i) a novel Semantic‑Scaffold‑GS representation balancing appearance and semantics to improve feature representations and boundary delineation; ii) a progressive appearance‑semantic joint training strategy to enhance stability and segmentation accuracy; and iii) a bottom‑up, category‑agnostic instance aggregation approach that addresses segmentation challenges through farthest point sampling and connected component analysis. Our approach achieves state‑of‑the‑art performance in category‑agnostic, open‑vocabulary 3D point‑level segmentation, highlighting the effectiveness of the proposed representation and training strategies. Project page: https://lhj‑git.github.io/InstanceGaussian/

Abstract:
Segment Anything Model 2 (SAM2) demonstrates exceptional performance in video segmentation and refinement of segmentation results. We anticipate that it can further evolve to achieve higher levels of automation for practical applications. Building upon SAM2, we conducted a series of practices that ultimately led to the development of a fully automated pipeline, termed Det‑SAM2, in which object prompts are automatically generated by a detection model to facilitate inference and refinement by SAM2. This pipeline enables inference on infinitely long video streams with constant VRAM and RAM usage, all while preserving the same efficiency and accuracy as the original SAM2. This technical report focuses on the construction of the overall Det‑SAM2 framework and the subsequent engineering optimization applied to SAM2. We present a case demonstrating an application built on the Det‑SAM2 framework: AI refereeing in a billiards scenario, derived from our business context. The project at \urlhttps://github.com/motern88/Det‑SAM2.

Abstract:
Text‑to‑image diffusion models have emerged as powerful priors for real‑world image super‑resolution (Real‑ISR). However, existing methods may produce unintended results due to noisy text prompts and their lack of spatial information. In this paper, we present HoliSDiP, a framework that leverages semantic segmentation to provide both precise textual and spatial guidance for diffusion‑based Real‑ISR. Our method employs semantic labels as concise text prompts while introducing dense semantic guidance through segmentation masks and our proposed Segmentation‑CLIP Map. Extensive experiments demonstrate that HoliSDiP achieves significant improvement in image quality across various Real‑ISR scenarios through reduced prompt noise and enhanced spatial control.

Abstract:
All‑weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out‑of‑distribution (OoD) samples or unseen degradations which limits their effectiveness for real‑world autonomous navigation. To overcome this issue, existing models must either be retrained or fine‑tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine‑tuning involving many parameters. In this paper, we propose using Low‑Rank Adaptation (LoRA) to efficiently adapt a pre‑trained all‑weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre‑trained restoration tasks. To address this issue, we introduce a LoRA‑based fine‑tuning method called LoRA‑Align (LoRA‑A) which seeks to align the singular vectors of the fine‑tuned and pre‑trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA‑A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation.

Abstract:
Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming‑like scenarios while retaining contextual information from past frames. We build upon the Segment‑Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine‑tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi‑modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, SAMWISE, achieves state‑of‑the‑art across various benchmarks, by adding a negligible overhead of less than 5 M parameters. Code is available at https://github.com/ClaudiaCuttano/SAMWISE .

Abstract:
Memory‑based trackers are video object segmentation methods that form the target model by concatenating recently tracked frames into a memory buffer and localize the target by attending the current image to the buffered frames. While already achieving top performance on many benchmarks, it was the recent release of SAM2 that placed memory‑based trackers into focus of the visual object tracking community. Nevertheless, modern trackers still struggle in the presence of distractors. We argue that a more sophisticated memory model is required, and propose a new distractor‑aware memory model for SAM2 and an introspection‑based update strategy that jointly addresses the segmentation accuracy as well as tracking robustness. The resulting tracker is denoted as SAM2.1++. We also propose a new distractor‑distilled DiDi dataset to study the distractor problem better. SAM2.1++ outperforms SAM2.1 and related SAM memory extensions on seven benchmarks and sets a solid new state‑of‑the‑art on six of them.

Abstract:
Object detection and semantic segmentation are both scene understanding tasks yet they differ in data structure and information level. Object detection requires box coordinates for object instances while semantic segmentation requires pixel‑wise class labels. Making use of one task's information to train the other would be beneficial for multi‑task partially supervised learning where each training example is annotated only for a single task, having the potential to expand training sets with different‑task datasets. This paper studies various weak losses for partially annotated data in combination with existing supervised losses. We propose Box‑for‑Mask and Mask‑for‑Box strategies, and their combination BoMBo, to distil necessary information from one task annotations to train the other. Ablation studies and experimental results on VOC and COCO datasets show favorable results for the proposed idea. Source code and data splits can be found at https://github.com/lhoangan/multas.

Abstract:
Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba‑based backbones cannot demonstrate performance that matches Convolution or Transformer‑based methods. By observing, we find that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low‑frequency information under Convolution‑Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low‑frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high‑frequency details and low‑frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high‑frequency branches, so as to efficiently trade‑off the high‑frequency and low‑frequency components at different layers. By integrating mobile‑friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic segmentation, object detection and instance segmentation. In particular, TinyViM outperforms Convolution, Transformer and Mamba‑based models with similar scales, and the throughput is about 2‑3 times higher than that of other Mamba‑based models. Code is available at https://github.com/xwmaxwma/TinyViM.

Abstract:
4D content generation aims to create dynamically evolving 3D content that responds to specific input objects such as images or 3D representations. Current approaches typically incorporate physical priors to animate 3D representations, but these methods suffer from significant limitations: they not only require users lacking physics expertise to manually specify material properties but also struggle to effectively handle the generation of multi‑material composite objects. To address these challenges, we propose Phys4DGen, a novel 4D generation framework that integrates multi‑material composition perception with physical simulation. The framework achieves automated, physically plausible 4D generation through three innovative modules: first, the 3D Material Grouping module partitions heterogeneous material regions on 3D representations' surfaces via semantic segmentation; second, the Internal Physical Structure Discovery module constructs the mechanical structure of object interiors; finally, we distill physical prior knowledge from multimodal large language models to enable rapid and automatic material properties identification for both objects' surfaces and interiors. Experiments on both synthetic and real‑world datasets demonstrate that Phys4DGen can generate high‑fidelity 4D content with physical realism in open‑world scenarios, significantly outperforming state‑of‑the‑art methods.

Abstract:
In recent years, significant progress has been made in collecting large‑scale datasets to improve segmentation and autonomous driving models. These large‑scale datasets are often dominated by common environmental conditions such as "Clear and Day" weather, leading to decreased performance in under‑represented conditions like "Rainy and Night". To address this issue, we introduce SynDiff‑AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff‑AD uses ControlNet‑a DM that guides data generation conditioned on semantic maps‑along with a novel prompting scheme that generates subgroup‑specific, semantically dense prompts. By augmenting datasets with SynDiff‑AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff‑AD pipeline enhances the driving performance of end‑to‑end autonomous driving models, like AIM‑2D and AIM‑BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model. We release our code and pipeline at https://github.com/UTAustin‑SwarmLab/SynDiff‑AD.

Abstract:
Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human‑annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo‑masks and then train a class‑agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo‑masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class‑agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.

Abstract:
Open‑vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision‑language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and text‑based object retrieval. To address these issues, we propose Octree‑Graph, a novel scene representation for open‑vocabulary 3D scene understanding. Specifically, a Chronological Group‑wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive‑octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree‑Graph is constructed where each adaptive‑octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely‑used datasets, demonstrating the versatility and effectiveness of our method. Code is available \hrefhttps://github.com/yifeisu/OV‑Octree‑Graphhere.

Abstract:
Recent breakthroughs in large foundation models have enabled the possibility of transferring knowledge pre‑trained on vast datasets to domains with limited data availability. Agriculture is one of the domains that lacks sufficient data. This study proposes a framework to train effective, domain‑specific, small models from foundation models without manual annotation. Our approach begins with SDM (Segmentation‑Description‑Matching), a stage that leverages two foundation models: SAM2 (Segment Anything in Images and Videos) for segmentation and OpenCLIP (Open Contrastive Language‑Image Pretraining) for zero‑shot open‑vocabulary classification. In the second stage, a novel knowledge distillation mechanism is utilized to distill compact, edge‑deployable models from SDM, enhancing both inference speed and perception accuracy. The complete method, termed SDM‑D (Segmentation‑Description‑Matching‑Distilling), demonstrates strong performance across various fruit detection tasks object detection, semantic segmentation, and instance segmentation) without manual annotation. It nearly matches the performance of models trained with abundant labels. Notably, SDM‑D outperforms open‑set detection methods such as Grounding SAM and YOLO‑World on all tested fruit detection datasets. Additionally, we introduce MegaFruits, a comprehensive fruit segmentation dataset encompassing over 25,000 images, and all code and datasets are made publicly available at https://github.com/AgRoboticsResearch/SDM‑D.git.

Abstract:
The ambition of brain‑inspired Spiking Neural Networks (SNNs) is to become a low‑power alternative to traditional Artificial Neural Networks (ANNs). This work addresses two major challenges in realizing this vision: the performance gap between SNNs and ANNs, and the high training costs of SNNs. We identify intrinsic flaws in spiking neurons caused by binary firing mechanisms and propose a Spike Firing Approximation (SFA) method using integer training and spike‑driven inference. This optimizes the spike firing pattern of spiking neurons, enhancing efficient training, reducing power consumption, improving performance, enabling easier scaling, and better utilizing neuromorphic chips. We also develop an efficient spike‑driven Transformer architecture and a spike‑masked autoencoder to prevent performance degradation during SNN scaling. On ImageNet‑1k, we achieve state‑of‑the‑art top‑1 accuracy of 78.5%, 79.8%, 84.0%, and 86.2% with models containing 10M, 19M, 83M, and 173M parameters, respectively. For instance, the 10M model outperforms the best existing SNN by 7.2% on ImageNet, with training time acceleration and inference energy efficiency improved by 4.5× and 3.9×, respectively. We validate the effectiveness and efficiency of the proposed method across various tasks, including object detection, semantic segmentation, and neuromorphic vision tasks. This work enables SNNs to match ANN performance while maintaining the low‑power advantage, marking a significant step towards SNNs as a general visual backbone. Code is available at https://github.com/BICLab/Spike‑Driven‑Transformer‑V3.

Abstract:
Deep learning has seen remarkable advancements in machine learning, yet it often demands extensive annotated data. Tasks like 3D semantic segmentation impose a substantial annotation burden, especially in domains like medicine, where expert annotations drive up the cost. Active learning (AL) holds great potential to alleviate this annotation burden in 3D medical segmentation. The majority of existing AL methods, however, are not tailored to the medical domain. While weakly‑supervised methods have been explored to reduce annotation burden, the fusion of AL with weak supervision remains unexplored, despite its potential to significantly reduce annotation costs. Additionally, there is little focus on slice‑based AL for 3D segmentation, which can also significantly reduce costs in comparison to conventional volume‑based AL. This paper introduces a novel metric learning method for Coreset to perform slice‑based active learning in 3D medical segmentation. By merging contrastive learning with inherent data groupings in medical imaging, we learn a metric that emphasizes the relevant differences in samples for training 3D medical segmentation models. We perform comprehensive evaluations using both weak and full annotations across four datasets (medical and non‑medical). Our findings demonstrate that our approach surpasses existing active learning techniques on both weak and full annotations and obtains superior performance with low‑annotation budgets which is crucial in medical imaging. Source code for this project is available in the supplementary materials and on GitHub: https://github.com/arvindmvepa/al‑seg.

Abstract:
Convolutions (Convs) and multi‑head self‑attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per‑pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine‑grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel at different granularity levels instead. Specifically, in each layer, we use two different ways to represent an image: a fine‑grained regular grid and a coarse‑grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local‑global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named GLMix: by offloading the burden of fine‑grained features to light‑weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state‑of‑the‑art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly‑supervised semantic segmentation approaches. Code will be available at \urlhttps://github.com/rayleizhu/GLMix.

Abstract:
Contrastive Language‑Image Pre‑training (CLIP) exhibits strong zero‑shot classification ability on various image‑level tasks, leading to the research to adapt CLIP for pixel‑level open‑vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image‑level CLIP, such as replacing self‑attention map at last layer with self‑self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early‑layer fusion module and a fine‑grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early‑layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine‑grained compensation module to compensate the local details using the self‑attention maps of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed CLIPer achieves the state‑of‑the‑art performance on these datasets. For instance, using ViT‑L, CLIPer has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

Abstract:
Volume parameterizations abound in recent literature, from the classic voxel grid to the implicit neural representation and everything in between. While implicit representations have shown impressive capacity and better memory efficiency compared to voxel grids, to date they require training via nonconvex optimization. This nonconvex training process can be slow to converge and sensitive to initialization and hyperparameter choices that affect the final converged result. We introduce a family of models, GA‑Planes, that is the first class of implicit neural volume representations that can be trained by convex optimization. GA‑Planes models include any combination of features stored in tensor basis elements, followed by a neural feature decoder. They generalize many existing representations and can be adapted for convex, semiconvex, or nonconvex training as needed for different inverse problems. In the 2D setting, we prove that GA‑Planes is equivalent to a low‑rank plus low‑resolution matrix factorization; we show that this approximation outperforms the classic low‑rank plus sparse decomposition for fitting a natural image. In 3D, we demonstrate GA‑Planes' competitive performance in terms of expressiveness, model size, and optimizability across three volume fitting tasks: radiance field reconstruction, 3D segmentation, and video segmentation.

Abstract:
Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision‑language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine‑grained segmentation boundaries. To address this gap, we propose a more meticulous mask‑level alignment between 3D features and the 2D‑text embedding space through a cross‑modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre‑trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open‑world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre‑trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask‑level 3D representations with the vision‑language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https://github.com/wangzy22/XMask3D.

Abstract:
Semantic segmentation is a crucial task in medical imaging. Although supervised learning techniques have proven to be effective in performing this task, they heavily depend on large amounts of annotated training data. The recently introduced Segment Anything Model (SAM) enables prompt‑based segmentation and offers zero‑shot generalization to unfamiliar objects. In our work, we leverage SAM's abstract object understanding for medical image segmentation to provide pseudo labels for semi‑supervised learning, thereby mitigating the need for extensive annotated training data. Our approach refines initial segmentations that are derived from a limited amount of annotated data (comprising up to 43 cases) by extracting bounding boxes and seed points as prompts forwarded to SAM. Thus, it enables the generation of dense segmentation masks as pseudo labels for unlabelled data. The results show that training with our pseudo labels yields an improvement in Dice score from 74.29\,% to 84.17\,% and from 66.63\,% to 74.87\,% for the segmentation of bones of the paediatric wrist and teeth in dental radiographs, respectively. As a result, our method outperforms intensity‑based post‑processing methods, state‑of‑the‑art supervised learning for segmentation (nnU‑Net), and the semi‑supervised mean teacher approach. Our Code is available on GitHub.

Abstract:
Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open‑vocabulary computer vision tasks, including Open‑Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open‑vocabulary capabilities. Our training‑free method, ITACLIP, outperforms current state‑of‑the‑art approaches on segmentation benchmarks such as COCO‑Stuff, COCO‑Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m‑arda‑aydn/ITACLIP.

Abstract:
The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast‑moving or self‑occluding objects. Furthermore, the fixed‑window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion‑aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine‑tuning. SAMURAI operates in real‑time and demonstrates strong zero‑shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine‑tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT_\textext and a 3.5% AO gain on GOT‑10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real‑world applications in dynamic environments.

Abstract:
Indoor radar perception has seen rising interest due to affordable costs driven by emerging automotive imaging radar developments and the benefits of reduced privacy concerns and reliability under hazardous conditions (e.g., fire and smoke). However, existing radar perception pipelines fail to account for distinctive characteristics of the multi‑view radar setting. In this paper, we propose Radar dEtection TRansformer (RETR), an extension of the popular DETR architecture, tailored for multi‑view radar perception. RETR inherits the advantages of DETR, eliminating the need for hand‑crafted components for object detection and segmentation in the image plane. More importantly, RETR incorporates carefully designed modifications such as 1) depth‑prioritized feature similarity via a tunable positional encoding (TPE); 2) a tri‑plane loss from both radar and camera coordinates; and 3) a learnable radar‑to‑camera transformation via reparameterization, to account for the unique multi‑view radar setting. Evaluated on two indoor radar perception datasets, our approach outperforms existing state‑of‑the‑art methods by a margin of 15.38+ AP for object detection and 11.91+ IoU for instance segmentation, respectively. Our implementation is available at https://github.com/merlresearch/radar‑detection‑transformer.

Abstract:
Open‑vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language‑Image Pre‑training (CLIP) excels in zero‑shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter‑class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter‑class correlations. To mitigate the problem that SAM‑generated masks may contain patches belonging to different classes, CorrCLIP incorporates self‑supervised models to compute coherent similarity values, suppressing the weight of inter‑class correlations. Additionally, we introduce two additional branches to strengthen patch features' spatial details and semantic representation. Finally, we update segmentation maps with SAM‑generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks. Codes are available at: https://github.com/zdk258/CorrCLIP.

Abstract:
Many state‑of‑the‑art computer vision architectures leverage U‑Net for its adaptability and efficient feature extraction. However, the multi‑resolution convolutional design often leads to significant computational demands, limiting deployment on edge devices. We present a streamlined alternative: a 1D convolutional encoder that retains accuracy while enhancing its suitability for edge applications. Our novel encoder architecture achieves semantic segmentation through channel‑wise 1D convolutions combined with pixel‑unshuffle operations. By incorporating PixelShuffle, known for improving accuracy in super‑resolution tasks while reducing computational load, OneNet captures spatial relationships without requiring 2D convolutions, reducing parameters by up to 47%. Additionally, we explore a fully 1D encoder‑decoder that achieves a 71% reduction in size, albeit with some accuracy loss. We benchmark our approach against U‑Net variants across diverse mask‑generation tasks, demonstrating that it preserves accuracy effectively. Although focused on image segmentation, this architecture is adaptable to other convolutional applications. Code for the project is available at https://github.com/shbyun080/OneNet .

Abstract:
While Contrastive Language‑Image Pre‑training (CLIP) has advanced open‑vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial‑invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self‑attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment‑then‑splice methods that segment sub‑images via a sliding window and splice the results, we introduce a splice‑then‑segment paradigm that incorporates Segment‑Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine‑grained semantic correlations from high‑resolution images. Specifically, we introduce Trident, a training‑free framework that first splices features extracted by CLIP and DINO from sub‑images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Abstract:
Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have performed particularly well in image classification and segmentation. Research on semantic and instance segmentation has accelerated with the introduction of the new architecture, with over 80% of the top 20 benchmarks for the iSAID dataset based on either the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID dataset. The experimental results observed during this research were analyzed based on three objectives. First, we studied the use of a weighted fused loss function to maximize the mean Intersection over Union (mIoU) score and Dice score while minimizing entropy or class representation loss. Second, we compared transfer learning on Meta's MaskFormer, a ViT‑based semantic segmentation model, against a generic UNet Convolutional Neural Network (CNN) based on mIoU, Dice scores, training efficiency, and inference time. Third, we examined the trade‑offs between the two models in comparison to current state‑of‑the‑art segmentation models. We show that the novel combined weighted loss function significantly boosts the CNN model's performance compared to transfer learning with ViT. The code for this implementation can be found at: https://github.com/ashimdahal/ViT‑vs‑CNN‑Image‑Segmentation.

Abstract:
In view of the fact that semi‑ and self‑supervised learning share a fundamental principle, effectively modeling knowledge from unlabeled data, various semi‑supervised semantic segmentation methods have integrated representative self‑supervised learning paradigms for further regularization. However, the potential of the state‑of‑the‑art generative self‑supervised paradigm, masked image modeling, has been scarcely studied. This paradigm learns the knowledge through establishing connections between the masked and visible parts of masked image, during the pixel reconstruction process. By inheriting and extending this insight, we successfully leverage masked image modeling to boost semi‑supervised semantic segmentation. Specifically, we introduce a novel class‑wise masked image modeling that independently reconstructs different image regions according to their respective classes. In this way, the mask‑induced connections are established within each class, mitigating the semantic confusion that arises from plainly reconstructing images in basic masked image modeling. To strengthen these intra‑class connections, we further develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class. Additionally, in semantic space, we explore the application of masked image modeling to enhance regularization. Extensive experiments conducted on well‑known benchmarks demonstrate that our approach achieves state‑of‑the‑art performance. The code will be available at https://github.com/haoxt/S4MIM.

Abstract:
Accurate object detection and prediction are critical to ensure the safety and efficiency of self‑driving architectures. Predicting object trajectories and occupancy enables autonomous vehicles to anticipate movements and make decisions with future information, increasing their adaptability and reducing the risk of accidents. Current State‑Of‑The‑Art (SOTA) approaches often isolate the detection, tracking, and prediction stages, which can lead to significant prediction errors due to accumulated inaccuracies between stages. Recent advances have improved the feature representation of multi‑camera perception systems through Bird's‑Eye View (BEV) transformations, boosting the development of end‑to‑end systems capable of predicting environmental elements directly from vehicle sensor data. These systems, however, often suffer from high processing times and number of parameters, creating challenges for real‑world deployment. To address these issues, this paper introduces a novel BEV instance prediction architecture based on a simplified paradigm that relies only on instance segmentation and flow prediction. The proposed system prioritizes speed, aiming at reduced parameter counts and inference times compared to existing SOTA architectures, thanks to the incorporation of an efficient transformer‑based architecture. Furthermore, the implementation of the proposed architecture is optimized for performance improvements in PyTorch version 2.1. Code and trained models are available at https://github.com/miguelag99/Efficient‑Instance‑Prediction

Abstract:
Unsupervised Domain Adaptation for Remote Sensing Semantic Segmentation (UDA‑RSSeg) addresses the challenge of adapting a model trained on source domain data to target domain samples, thereby minimizing the need for annotated data across diverse remote sensing scenes. This task presents two principal challenges: (1) severe inconsistencies in feature representation across different remote sensing domains, and (2) a domain gap that emerges due to the representation bias of source domain patterns when translating features to predictive logits. To tackle these issues, we propose a joint‑optimized adversarial network incorporating the "Segment Anything Model (SAM) (SAM‑JOANet)" for UDA‑RSSeg. Our approach integrates SAM to leverage its robust generalized representation capabilities, thereby alleviating feature inconsistencies. We introduce a finetuning decoder designed to convert SAM‑Encoder features into predictive logits. Additionally, a feature‑level adversarial‑based prompted segmentor is employed to generate class‑agnostic maps, which guide the finetuning decoder's feature representations. The network is optimized end‑to‑end, combining the prompted segmentor and the finetuning decoder. Extensive evaluations on benchmark datasets, including ISPRS (Potsdam/Vaihingen) and CITY‑OSM (Paris/Chicago), demonstrate the effectiveness of our method. The results, supported by visualization and analysis, confirm the method's interpretability and robustness. The code of this paper is available at https://github.com/CV‑ShuchangLyu/SAM‑JOANet.

Abstract:
In semi‑supervised semantic segmentation (SSS), weak‑to‑strong consistency regularization techniques are widely utilized in recent works, typically combined with input‑level and feature‑level perturbations. However, the integration between weak‑to‑strong consistency regularization and network perturbation has been relatively rare. We note several problems with existing network perturbations in SSS that may contribute to this phenomenon. By revisiting network perturbations, we introduce a new approach for network perturbation to expand the existing weak‑to‑strong consistency regularization for unlabeled data. Additionally, we present a volatile learning process for labeled data, which is uncommon in existing research. Building upon previous work that includes input‑level and feature‑level perturbations, we present MLPMatch (Multi‑Level‑Perturbation Match), an easy‑to‑implement and efficient framework for semi‑supervised semantic segmentation. MLPMatch has been validated on the Pascal VOC and Cityscapes datasets, achieving state‑of‑the‑art performance. Code is available from https://github.com/LlistenL/MLPMatch.

Abstract:
Multi‑modal image fusion (MMIF) enhances the information content of the fused image by combining the unique as well as common features obtained from different modality sensor images, improving visualization, object detection, and many more tasks. In this work, we introduce an interpretable network for the MMIF task, named FNet, based on an \ell_0‑regularized multi‑modal convolutional sparse coding (MCSC) model. Specifically, for solving the \ell_0‑regularized CSC problem, we design a learnable \ell_0‑regularized sparse coding (LZSC) block in a principled manner through deep unfolding. Given different modality source images, FNet first separates the unique and common features from them using the LZSC block and then these features are combined to generate the final fused image. Additionally, we propose an \ell_0‑regularized MCSC model for the inverse fusion process. Based on this model, we introduce an interpretable inverse fusion network named IFNet, which is utilized during FNet's training. Extensive experiments show that FNet achieves high‑quality fusion results across eight different MMIF datasets. Furthermore, we show that FNet enhances downstream object detection \textcolor[rgb] 0, 0, 0and semantic segmentation in visible‑thermal image pairs. We have also visualized the intermediate results of FNet, which demonstrates the good interpretability of our network. Link for code and models: https://github.com/gargi884/FNet‑MMIF.

Abstract:
In open‑world scenarios, where both novel classes and domains may exist, an ideal segmentation model should detect anomaly classes for safety and generalize to new domains. However, existing methods often struggle to distinguish between domain‑level and semantic‑level distribution shifts, leading to poor out‑of‑distribution (OOD) detection or domain generalization performance. In this work, we aim to equip the model to generalize effectively to covariate‑shift regions while precisely identifying semantic‑shift regions. To achieve this, we design a novel generative augmentation method to produce coherent images that incorporate both anomaly (or novel) objects and various covariate shifts at both image and object levels. Furthermore, we introduce a training strategy that recalibrates uncertainty specifically for semantic shifts and enhances the feature extractor to align features associated with domain shifts. We validate the effectiveness of our method across benchmarks featuring both semantic and domain shifts. Our method achieves state‑of‑the‑art performance across all benchmarks for both OOD detection and domain generalization. Code is available at https://github.com/gaozhitong/MultiShiftSeg.

Abstract:
State‑of‑the‑art methods for Transformer‑based semantic segmentation typically adopt Transformer decoders that are used to extract additional embeddings from image embeddings via cross‑attention, refine either or both types of embeddings via self‑attention, and project image embeddings onto the additional embeddings via dot‑product. Despite their remarkable success, these empirical designs still lack theoretical justifications or interpretations, thus hindering potentially principled improvements. In this paper, we argue that there are fundamental connections between semantic segmentation and compression, especially between the Transformer decoders and Principal Component Analysis (PCA). From such a perspective, we derive a white‑box, fully attentional DEcoder for PrIncipled semantiC segemenTation (DEPICT), with the interpretations as follows: 1) the self‑attention operator refines image embeddings to construct an ideal principal subspace that aligns with the supervision and retains most information; 2) the cross‑attention operator seeks to find a low‑rank approximation of the refined image embeddings, which is expected to be a set of orthonormal bases of the principal subspace and corresponds to the predefined classes; 3) the dot‑product operation yields compact representation for image embeddings as segmentation masks. Experiments conducted on dataset ADE20K find that DEPICT consistently outperforms its black‑box counterpart, Segmenter, and it is light weight and more robust.

Abstract:
Semi‑supervised video object segmentation (VOS) has been largely driven by space‑time memory (STM) networks, which store past frame features in a spatiotemporal memory to segment the current frame via softmax attention. However, STM networks face memory limitations due to the quadratic complexity of softmax matching, restricting their applicability as video length and resolution increase. To address this, we propose LiVOS, a lightweight memory network that employs linear matching via linear attention, reformulating memory matching into a recurrent process that reduces the quadratic attention matrix to a constant‑size, spatiotemporal‑agnostic 2D state. To enhance selectivity, we introduce gated linear matching, where a data‑dependent gate matrix is multiplied with the state matrix to control what information to retain or discard. Experiments on diverse benchmarks demonstrated the effectiveness of our method. It achieved 64.8 J&F on MOSE and 85.1 J&F on DAVIS, surpassing all non‑STM methods and narrowing the gap with STM‑based approaches. For longer and higher‑resolution videos, it matched STM‑based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer‑grade GPU‑‑a previously cost‑prohibitive capability‑‑opening the door for long and high‑resolution video foundation models.

Abstract:
In this paper an exhaustive review and comprehensive analysis of recent and former deep learning methods in 3D Semantic Segmentation (3DSS) is presented. In the related literature, the taxonomy scheme used for the classification of the 3DSS deep learning methods is ambiguous. Based on the taxonomy schemes of 9 existing review papers, a new taxonomy scheme of the 3DSS deep learning methods is proposed, aiming to standardize it and improve the comparability and clarity across related studies. Furthermore, an extensive overview of the available 3DSS indoor and outdoor datasets is provided along with their links. The core part of the review is the detailed presentation of recent and former 3DSS deep learning methods and their classification using the proposed taxonomy scheme along with their GitHub repositories. Additionally, a brief but informative analysis of the evaluation metrics and loss functions used in 3DSS is included. Finally, a fruitful discussion of the examined 3DSS methods and datasets, is presented to foster new research directions and applications in the field of 3DSS. Supplementary, to this review a GitHub repository is provided (https://github.com/thobet/Deep‑Learning‑on‑3D‑Semantic‑Segmentation‑a‑ Detailed‑Review) including a quick classification of over 400 3DSS methods, using the proposed taxonomy scheme.

Abstract:
Humans excel at detecting and segmenting moving objects according to the Gestalt principle of "common fate". Remarkably, previous works have shown that human perception generalizes this principle in a zero‑shot fashion to unseen textures or random dots. In this work, we seek to better understand the computational basis for this capability by evaluating a broad range of optical flow models and a neuroscience inspired motion energy model for zero‑shot figure‑ground segmentation of random dot stimuli. Specifically, we use the extensively validated motion energy model proposed by Simoncelli and Heeger in 1998 which is fitted to neural recordings in cortex area MT. We find that a cross section of 40 deep optical flow models trained on different datasets struggle to estimate motion patterns in random dot videos, resulting in poor figure‑ground segmentation performance. Conversely, the neuroscience‑inspired model significantly outperforms all optical flow models on this task. For a direct comparison to human perception, we conduct a psychophysical study using a shape identification task as a proxy to measure human segmentation performance. All state‑of‑the‑art optical flow models fall short of human performance, but only the motion energy model matches human capability. This neuroscience‑inspired model successfully addresses the lack of human‑like zero‑shot generalization to random dot stimuli in current computer vision models, and thus establishes a compelling link between the Gestalt psychology of human object perception and cortical motion processing in the brain. Code, models and datasets are available at https://github.com/mtangemann/motion_energy_segmentation

Abstract:
Recent advancements in generative AI, particularly diffusion‑based image editing, have enabled the transformation of images into highly realistic scenes using only text instructions. This technology offers significant potential for generating diverse synthetic datasets to evaluate model robustness. In this paper, we introduce Cityscape‑Adverse, a benchmark that employs diffusion‑based image editing to simulate eight adverse conditions, including variations in weather, lighting, and seasons, while preserving the original semantic labels. We evaluate the reliability of diffusion‑based models in generating realistic scene modifications and assess the performance of state‑of‑the‑art CNN and Transformer‑based semantic segmentation models under these challenging conditions. Additionally, we analyze which modifications have the greatest impact on model performance and explore how training on synthetic datasets can improve robustness in real‑world adverse scenarios. Our results demonstrate that all tested models, particularly CNN‑based architectures, experienced significant performance degradation under extreme conditions, while Transformer‑based models exhibited greater resilience. We verify that models trained on Cityscape‑Adverse show significantly enhanced resilience when applied to unseen domains. Code and datasets will be released at https://github.com/naufalso/cityscape‑adverse.

Abstract:
Federated Learning (FL) is a form of distributed learning that allows multiple institutions or clients to collaboratively learn a global model to solve a task. This allows the model to utilize the information from every institute while preserving data privacy. However, recent studies show that the promise of protecting the privacy of data is not upheld by existing methods and that it is possible to recreate the training data from the different institutions. This is done by utilizing gradients transferred between the clients and the global server during training or by knowing the model architecture at the client end. In this paper, we propose a federated learning framework for semantic segmentation without knowing the model architecture nor transferring gradients between the client and the server, thus enabling better privacy preservation. We propose BlackFed ‑ a black‑box adaptation of neural networks that utilizes zero order optimization (ZOO) to update the client model weights and first order optimization (FOO) to update the server weights. We evaluate our approach on several computer vision and medical imaging datasets to demonstrate its effectiveness. To the best of our knowledge, this work is one of the first works in employing federated learning for segmentation, devoid of gradients or model information exchange. Code: https://github.com/JayParanjape/blackfed/tree/master

Abstract:
Existing multi‑modal image fusion methods fail to address the compound degradations presented in source images, resulting in fusion images plagued by noise, color bias, improper exposure, etc. Additionally, these methods often overlook the specificity of foreground objects, weakening the salience of the objects of interest within the fused images. To address these challenges, this study proposes a novel interactive multi‑modal image fusion framework based on the text‑modulated diffusion model, called Text‑DiFuse. First, this framework integrates feature‑level information integration into the diffusion process, allowing adaptive degradation removal and multi‑modal information fusion. This is the first attempt to deeply and explicitly embed information fusion within the diffusion process, effectively addressing compound degradation in image fusion. Second, by embedding the combination of the text and zero‑shot location model into the diffusion fusion process, a text‑controlled fusion re‑modulation strategy is developed. This enables user‑customized text control to improve fusion performance and highlight foreground objects in the fused images. Extensive experiments on diverse public datasets show that our Text‑DiFuse achieves state‑of‑the‑art fusion performance across various scenarios with complex degradation. Moreover, the semantic segmentation experiment validates the significant enhancement in semantic performance achieved by our text‑controlled fusion re‑modulation strategy. The code is publicly available at https://github.com/Leiii‑Cao/Text‑DiFuse.

Abstract:
The field of Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. Despite the substantial domain gaps in RS images that are characterized by variabilities such as location, wavelength, and sensor type, research in this area remains underexplored: (1) Current cross‑domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies targeting the RSDG issue, especially for semantic segmentation tasks, where existing models are developed for specific unknown domains, struggling with issues of underfitting on other unknown scenarios; (3) Existing RS foundation models tend to prioritize in‑domain performance over cross‑domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross‑domain generalization through a specially designed data‑level Earth‑Style Injection pipeline and a model‑level Multi‑Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 cross‑domain settings across various regions, spectral bands, platforms, and climates, providing a comprehensive framework for testing the generalizability of future RSDG models. Extensive experiments on this benchmark demonstrate the superiority of CrossEarth over existing state‑of‑the‑art methods.

Abstract:
Few‑shot 3D point cloud segmentation (FS‑PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS‑PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS‑PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy‑to‑achieve setup, we present the MultiModal Few‑Shot SegNet (MM‑FSS), a model effectively harnessing complementary information from multiple modalities. MM‑FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text‑aware semantic guidance. Additionally, we propose a simple yet effective Test‑time Adaptive Cross‑modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly‑ignored free modalities for FS‑PCS, providing valuable insights for future research. The code is available at https://github.com/ZhaochongAn/Multimodality‑3D‑Few‑Shot

Abstract:
Despite their success, unsupervised domain adaptation methods for semantic segmentation primarily focus on adaptation between image domains and do not utilize other abundant visual modalities like depth, infrared and event. This limitation hinders their performance and restricts their application in real‑world multimodal scenarios. To address this issue, we propose Modality Adaptation with text‑to‑image Diffusion Models (MADM) for semantic segmentation task which utilizes text‑to‑image diffusion models pre‑trained on extensive image‑text pairs to enhance the model's cross‑modality capabilities. Specifically, MADM comprises two key complementary components to tackle major challenges. First, due to the large modality gap, using one modal data to generate pseudo labels for another modality suffers from a significant drop in accuracy. To address this, MADM designs diffusion‑based pseudo‑label generation which adds latent noise to stabilize pseudo‑labels and enhance label accuracy. Second, to overcome the limitations of latent low‑resolution features in diffusion models, MADM introduces the label palette and latent regression which converts one‑hot encoded labels into the RGB form by palette and regresses them in the latent space, thus ensuring the pre‑trained decoder for up‑sampling to obtain fine‑grained features. Extensive experimental results demonstrate that MADM achieves state‑of‑the‑art adaptation performance across various modality tasks, including images to depth, infrared, and event modalities. We open‑source our code and models at https://github.com/XiaRho/MADM.

Abstract:
High‑speed video (HSV) segmentation is essential for analyzing dynamic physical processes in scientific and industrial applications, such as boiling heat transfer. Existing models like U‑Net struggle with generalization and accurately segmenting complex bubble formations. We present VideoSAM, a specialized adaptation of the Segment Anything Model (SAM), fine‑tuned on a diverse HSV dataset for phase detection. Through diverse experiments, VideoSAM demonstrates superior performance across four fluid environments ‑‑ Water, FC‑72, Nitrogen, and Argon ‑‑ significantly outperforming U‑Net in complex segmentation tasks. In addition to introducing VideoSAM, we contribute an open‑source HSV segmentation dataset designed for phase detection, enabling future research in this domain. Our findings underscore VideoSAM's potential to set new standards in robust and accurate HSV segmentation. The code and dataset used in this study are available online at https://github.com/chikap421/videosam.

Abstract:
In the evolving landscape of deep learning, there is a pressing need for more comprehensive datasets capable of training models across multiple modalities. Concurrently, in digital humanities, there is a growing demand to leverage technology for diverse media adaptation and creation, yet limited by sparse datasets due to copyright and stylistic constraints. Addressing this gap, our paper presents a novel dataset comprising Franco‑Belgian comics from the 1950s annotated for tasks including depth estimation, semantic segmentation, saliency detection, and character identification. It consists of two distinct and consistent styles and incorporates object concepts and labels taken from natural images. By including such diverse information across styles, this dataset not only holds promise for computational creativity but also offers avenues for the digitization of art and storytelling innovation. This dataset is a crucial component of the AI4VA Workshop Challenges~\urlhttps://sites.google.com/view/ai4vaeccv2024, where we specifically explore depth and saliency. Dataset details at \urlhttps://github.com/IVRL/AI4VA.

Abstract:
In cross‑modal unsupervised domain adaptation, a model trained on source‑domain data (e.g., synthetic) is adapted to target‑domain data (e.g., real‑world) without access to target annotation. Previous methods seek to mutually mimic cross‑modal outputs in each domain, which enforces a class probability distribution that is agreeable in different domains. However, they overlook the complementarity brought by the heterogeneous fusion in cross‑modal learning. In light of this, we propose a novel fusion‑then‑distillation (FtD++) method to explore cross‑modal positive distillation of the source and target domains for 3D semantic segmentation. FtD++ realizes distribution consistency between outputs not only for 2D images and 3D point clouds but also for source‑domain and augment‑domain. Specially, our method contains three key ingredients. First, we present a model‑agnostic feature fusion module to generate the cross‑modal fusion representation for establishing a latent space. In this space, two modalities are enforced maximum correlation and complementarity. Second, the proposed cross‑modal positive distillation preserves the complete information of multi‑modal input and combines the semantic content of the source domain with the style of the target domain, thereby achieving domain‑modality alignment. Finally, cross‑modal debiased pseudo‑labeling is devised to model the uncertainty of pseudo‑labels via a self‑training manner. Extensive experiments report state‑of‑the‑art results on several domain adaptive scenarios under unsupervised and semi‑supervised settings. Code is available at https://github.com/Barcaaaa/FtD‑PlusPlus.

Abstract:
Autonomous agents require the capability to identify dynamic objects in their environment for safe planning and navigation. Incomplete and erroneous dynamic detections jeopardize the agent's ability to accomplish its task. Dynamic detection is a challenging problem due to the numerous sources of uncertainty inherent in the problem's inputs and the wide variety of applications, which often lead to use‑case‑tailored solutions. We propose a robust learning‑free approach to segment moving objects in point cloud data. The foundation of the approach lies in modelling each voxel using a hidden Markov model (HMM), and probabilistically integrating beliefs into a map using an HMM filter. The proposed approach is tested on benchmark datasets and consistently performs better than or as well as state‑of‑the‑art methods with strong generalized performance across sensor characteristics and environments. The approach is open‑sourced at https://github.com/vb44/HMM‑MOS.

Abstract:
Simulators are indispensable for research in autonomous systems such as self‑driving cars, autonomous robots, and drones. Despite significant progress in various simulation aspects, such as graphical realism, an evident gap persists between the virtual and real‑world environments. Since the ultimate goal is to deploy the autonomous systems in the real world, reducing the sim2real gap is of utmost importance. In this paper, we employ a state‑of‑the‑art approach to enhance the photorealism of simulated data, aligning them with the visual characteristics of real‑world datasets. Based on this, we developed CARLA2Real, an easy‑to‑use, publicly available tool (plug‑in) for the widely used and open‑source CARLA simulator. This tool enhances the output of CARLA in near real‑time, achieving a frame rate of 13 FPS, translating it to the visual style and realism of real‑world datasets such as Cityscapes, KITTI, and Mapillary Vistas. By employing the proposed tool, we generated synthetic datasets from both the simulator and the enhancement model outputs, including their corresponding ground truth annotations for tasks related to autonomous driving. Then, we performed a number of experiments to evaluate the impact of the proposed approach on feature extraction and semantic segmentation methods when trained on the enhanced synthetic data. The results demonstrate that the sim2real appearance gap is significant and can indeed be reduced by the introduced approach. Comparisons with a state‑of‑the‑art image‑to‑image translation approach are also provided. The tool, pre‑trained models, and associated data for this work are available for download at: https://github.com/stefanos50/CARLA2Real.

Abstract:
Vision‑language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision‑making in open‑world environments presents challenges. One critical issue is bridging the gap between discrete entities in low‑level observations and the abstract concepts required for effective planning. A common solution is building hierarchical agents, where VLMs serve as high‑level reasoners that break down tasks into executable sub‑tasks, typically specified using language. However, language suffers from the inability to communicate detailed spatial information. We propose visual‑temporal context prompting, a novel communication protocol between VLMs and policy models. This protocol leverages object segmentation from past observations to guide policy‑environment interactions. Using this approach, we train ROCKET‑1, a low‑level policy that predicts actions based on concatenated visual observations and segmentation masks, supported by real‑time object tracking from SAM‑2. Our method unlocks the potential of VLMs, enabling them to tackle complex tasks that demand spatial reasoning. Experiments in Minecraft show that our approach enables agents to achieve previously unattainable tasks, with a \mathbf76% absolute improvement in open‑world interaction performance. Codes and demos are now available on the project page: https://craftjarvis.github.io/ROCKET‑1.

Abstract:
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel‑level annotation is both labor‑intensive and costly, primarily due to the intricate object‑background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero‑shot manner without manual annotations for any camouflaged object?" we affirmatively respond and introduce a robust zero‑shot COS framework. This framework leverages the inherent local pattern bias of COS and employs a broad semantic feature space derived from salient object segmentation (SOS) for efficient zero‑shot transfer. We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter‑Efficient Fine‑Tuning (PEFT), a Multimodal Large Language Model (M‑LLM), and a Multi‑scale Fine‑grained Alignment (MFA) mechanism. The MIM pre‑trained image encoder focuses on capturing essential low‑level features, while the M‑LLM generates caption embeddings processed alongside these visual cues. These embeddings are precisely aligned using MFA, enabling our framework to accurately interpret and navigate complex semantic contexts. To optimize operational efficiency, we introduce a learnable codebook that represents the M‑LLM during inference, significantly reducing computational overhead. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state‑of‑the‑art performance in zero‑shot COS with F_β^w scores of 72.9% on CAMO and 71.7% on COD10K. By removing the M‑LLM during inference, we achieve an inference speed comparable to that of traditional end‑to‑end models, reaching 18.1 FPS. Code: https://github.com/AVC2‑UESTC/ZSCOS‑CaMF

Abstract:
While image‑text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self‑supervised image‑only pretraining is still the go‑to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image‑text and self‑supervised learning, by proposing a novel general‑purpose image‑text model, which can be effectively used off the shelf for dense and global vision tasks. Our method, which we refer to as Text‑Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image‑text learning with self‑supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off‑the‑shelf performance on both dense and global understanding, for several image‑only and image‑text tasks. Code and models are released at https://github.com/google‑deepmind/tips.

Abstract:
The Segment Anything Model 2 (SAM 2) has emerged as a powerful foundation model for object segmentation in both images and videos, paving the way for various downstream video applications. The crucial design of SAM 2 for video segmentation is its memory module, which prompts object‑aware memories from previous frames for current frame prediction. However, its greedy‑selection memory design suffers from the "error accumulation" problem, where an errored or missed mask will cascade and influence the segmentation of the subsequent frames, which limits the performance of SAM 2 toward complex long‑term videos. To this end, we introduce SAM2Long, an improved training‑free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video‑level optimal results from multiple segmentation pathways in a constrained tree search manner. In practice, we maintain a fixed number of segmentation pathways throughout the video. For each frame, multiple masks are proposed based on the existing pathways, creating various candidate branches. We then select the same fixed number of branches with higher cumulative scores as the new pathways for the next frame. After processing the final frame, the pathway with the highest cumulative score is chosen as the final segmentation result. Benefiting from its heuristic search design, SAM2Long is robust toward occlusions and object reappearances, and can effectively segment and track objects for complex long‑term videos. Notably, SAM2Long achieves an average improvement of 3.0 points across all 24 head‑to‑head comparisons, with gains of up to 5.3 points in J&F on long‑term video object segmentation benchmarks such as SA‑V and LVOS. The code is released at https://github.com/Mark12Ding/SAM2Long.

Abstract:
In this paper, we propose LiOn‑XA, an unsupervised domain adaptation (UDA) approach that combines LiDAR‑Only Cross‑Modal (X) learning with Adversarial training for 3D LiDAR point cloud semantic segmentation to bridge the domain gap arising from environmental and sensor setup changes. Unlike existing works that exploit multiple data modalities like point clouds and RGB image data, we address UDA in scenarios where RGB images might not be available and show that two distinct LiDAR data representations can learn from each other for UDA. More specifically, we leverage 3D voxelized point clouds to preserve important geometric structure in combination with 2D projection‑based range images that provide information such as object orientations or surfaces. To further align the feature space between both domains, we apply adversarial training using both features and predictions of both 2D and 3D neural networks. Our experiments on 3 real‑to‑real adaptation scenarios demonstrate the effectiveness of our approach, achieving new state‑of‑the‑art performance when compared to previous uni‑ and multi‑model UDA methods. Our source code is publicly available at https://github.com/JensLe97/lion‑xa.

Abstract:
Semantic Scene Completion (SSC) aims to perform geometric completion and semantic segmentation simultaneously. Despite the promising results achieved by existing studies, the inherently ill‑posed nature of the task presents significant challenges in diverse driving scenarios. This paper introduces TALoS, a novel test‑time adaptation approach for SSC that excavates the information available in driving environments. Specifically, we focus on that observations made at a certain moment can serve as Ground Truth (GT) for scene completion at another moment. Given the characteristics of the LiDAR sensor, an observation of an object at a certain location confirms both 1) the occupation of that location and 2) the absence of obstacles along the line of sight from the LiDAR to that point. TALoS utilizes these observations to obtain self‑supervision about occupancy and emptiness, guiding the model to adapt to the scene in test time. In a similar manner, we aggregate reliable SSC predictions among multiple moments and leverage them as semantic pseudo‑GT for adaptation. Further, to leverage future observations that are not accessible at the current time, we present a dual optimization scheme using the model in which the update is delayed until the future observation is available. Evaluations on the SemanticKITTI validation and test sets demonstrate that TALoS significantly improves the performance of the pre‑trained SSC model. Our code is available at https://github.com/blue‑531/TALoS.

Abstract:
Empirical studies show that federated learning exhibits convergence issues in Non Independent and Identically Distributed (IID) setups. However, these studies only focus on label distribution shifts, or concept shifts (e.g. ambiguous tasks). In this paper, we explore for the first time the effect of covariate shifts between participants' data in 2D segmentation tasks, showing an impact way less serious than label shifts but still present on convergence. Moreover, current Personalized (PFL) and Clustered (CFL) Federated Learning methods intrinsically assume the homogeneity of the dataset of each participant and its consistency with future test samples by operating at the client level. We introduce a more general and realistic framework where each participant owns a mixture of multiple underlying feature domain distributions. To diagnose such pathological feature distributions affecting a model being trained in a federated fashion, we develop Deep Domain Isolation (DDI) to isolate image domains directly in the gradient space of the model. A federated Gaussian Mixture Model is fit to the sample gradients of each class, while the results are combined with spectral clustering on the server side to isolate decentralized sample‑level domains. We leverage this clustering algorithm through a Sample Clustered Federated Learning (SCFL) framework, performing standard federated learning of several independent models, one for each decentralized image domain. Finally, we train a classifier enabling to associate a test sample to its corresponding domain cluster at inference time, offering a final set of models that are agnostic to any assumptions on the test distribution of each participant. We validate our approach on a toy segmentation dataset as well as different partitionings of a combination of Cityscapes and GTA5 datasets using an EfficientVIT‑B0 model, showing a significant performance gain compared to other approaches. Our code is available at https://github.com/MatthisManthe/DDI_SCFL .

Abstract:
We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2. Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed‑ups. Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open‑vocabulary object segmentation tasks, highlighting the versatility of our approach.

Abstract:
This work proposes a novel approach beyond supervised learning for effective pathological image analysis, addressing the challenge of limited robust labeled data. Pathological diagnosis of diseases like cancer has conventionally relied on the evaluation of morphological features by physicians and pathologists. However, recent advancements in compute‑aided diagnosis (CAD) systems are gaining significant attention as diagnostic support tools. Although the advancement of deep learning has improved CAD significantly, segmentation models typically require large pixel‑level annotated dataset, and such labeling is expensive. Existing studies not based on supervised approaches still struggle with limited generalization, and no practical approach has emerged yet. To address this issue, we present a weakly supervised semantic segmentation (WSSS) model by combining class activation map and Segment Anything Model (SAM)‑based pseudo‑labeling. For effective pretraining, we adopt the SAM‑a foundation model that is pretrained on large datasets and operates in zero‑shot configurations using only coarse prompts. The proposed approach transfer enhanced Attention Dropout Layer's knowledge to SAM, thereby generating pseudo‑labels. To demonstrate the superiority of the proposed method, experimental studies are conducted on histopathological breast cancer datasets. The proposed method outperformed other WSSS methods across three datasets, demonstrating its efficiency by achieving this with only 12GB of GPU memory during training. Our code is available at : https://github.com/QI‑NemoSong/EP‑SAM

Abstract:
Semantic segmentation of remote sensing (RS) images is a challenging yet essential task with broad applications. While deep learning, particularly supervised learning with large‑scale labeled datasets, has significantly advanced this field, the acquisition of high‑quality labeled data remains costly and time‑intensive. Unsupervised domain adaptation (UDA) provides a promising alternative by enabling models to learn from unlabeled target domain data while leveraging labeled source domain data. Recent self‑training (ST) approaches employing pseudo‑label generation have shown potential in mitigating domain discrepancies. However, the application of ST to RS image segmentation remains underexplored. Factors such as variations in ground sampling distance, imaging equipment, and geographic diversity exacerbate domain shifts, limiting model performance across domains. In that case, existing ST methods, due to significant domain shifts in cross‑domain RS images, often underperform. To address these challenges, we propose integrating contrastive learning into UDA, enhancing the model's ability to capture semantic information in the target domain by maximizing the similarity between augmented views of the same image. This additional supervision improves the model's representational capacity and segmentation performance in the target domain. Extensive experiments conducted on RS datasets, including Potsdam, Vaihingen, and LoveDA, demonstrate that our method, SimSeg, outperforms existing approaches, achieving state‑of‑the‑art results. Visualization and quantitative analyses further validate SimSeg's superior ability to learn from the target domain. The code is publicly available at https://github.com/woldier/SiamSeg.

Abstract:
Referring 3D Segmentation is a visual‑language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two‑stage paradigm, first conducting language‑agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human‑labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label‑Efficient and Single‑Stage, dubbed LESS, which is only under the supervision of efficient binary mask. Specifically, we design a Point‑Word Cross‑Modal Alignment module for aligning the fine‑grained features of points and textual embedding. Query Mask Predictor module and Query‑Sentence Alignment module are introduced for coarse‑grained alignment between masks and query. Furthermore, we propose an area regularization loss, which coarsely reduces irrelevant background predictions on a large scale. Besides, a point‑to‑point contrastive loss is proposed concentrating on distinguishing points with subtly similar features. Through extensive experiments, we achieve state‑of‑the‑art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels. Code is available at https://github.com/mellody11/LESS.

Abstract:
Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance‑level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three‑stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at https://github.com/function2‑llx/MMMM.

Abstract:
This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self‑morphable neural networks. Contrary to crafting separate models for different architectures or sizes, NeuMeta directly learns the continuous weight manifold of neural networks. Once trained, we can sample weights for any‑sized network directly from the manifold, even for previously unseen configurations, without retraining. To achieve this ambitious goal, NeuMeta trains neural implicit functions as hypernetworks. They accept coordinates within the model space as input, and generate corresponding weight values on the manifold. In other words, the implicit function is learned in a way, that the predicted weights is well‑performed across various models sizes. In training those models, we notice that, the final performance closely relates on smoothness of the learned manifold. In pursuit of enhancing this smoothness, we employ two strategies. First, we permute weight matrices to achieve intra‑model smoothness, by solving the Shortest Hamiltonian Path problem. Besides, we add a noise on the input coordinates when training the implicit function, ensuring models with various sizes shows consistent outputs. As such, NeuMeta shows promising results in synthesizing parameters for various network configurations. Our extensive tests in image classification, semantic segmentation, and image generation reveal that NeuMeta sustains full‑size performance even at a 75% compression rate.

Abstract:
Real‑world datasets follow an imbalanced distribution, which poses significant challenges in rare‑category object detection. Recent studies tackle this problem by developing re‑weighting and re‑sampling methods, that utilise the class frequencies of the dataset. However, these techniques focus solely on the frequency statistics and ignore the distribution of the classes in image space, missing important information. In contrast to them, we propose FRActal CALibration (FRACAL): a novel post‑calibration method for long‑tailed object detection. FRACAL devises a logit adjustment method that utilises the fractal dimension to estimate how uniformly classes are distributed in image space. During inference, it uses the fractal dimension to inversely downweight the probabilities of uniformly spaced class predictions achieving balance in two axes: between frequent and rare categories, and between uniformly spaced and sparsely spaced classes. FRACAL is a post‑processing method and it does not require any training, also it can be combined with many off‑the‑shelf models such as one‑stage sigmoid detectors and two‑stage instance segmentation models. FRACAL boosts the rare class performance by up to 8.6% and surpasses all previous methods on LVIS dataset, while showing good generalisation to other datasets such as COCO, V3Det and OpenImages. We provide the code at https://github.com/kostas1515/FRACAL.

Abstract:
Multimodal remote sensing data, acquired from diverse sensors, offer a comprehensive and integrated perspective of the Earth's surface. Leveraging multimodal fusion techniques, semantic segmentation enables detailed and accurate analysis of geographic scenes, surpassing single‑modality approaches. Building on advancements in vision foundation models, particularly the Segment Anything Model (SAM), this study proposes a unified framework incorporating a novel Multimodal Fine‑tuning Network (MFNet) for remote sensing semantic segmentation. The proposed framework is designed to seamlessly integrate with various fine‑tuning mechanisms, demonstrated through the inclusion of Adapter and Low‑Rank Adaptation (LoRA) as representative examples. This extensibility ensures the framework's adaptability to other emerging fine‑tuning strategies, allowing models to retain SAM's general knowledge while effectively leveraging multimodal data. Additionally, a pyramid‑based Deep Fusion Module (DFM) is introduced to integrate high‑level geographic features across multiple scales, enhancing feature representation prior to decoding. This work also highlights SAM's robust generalization capabilities with Digital Surface Model (DSM) data, a novel application. Extensive experiments on three benchmark multimodal remote sensing datasets, ISPRS Vaihingen, ISPRS Potsdam and MMHunan, demonstrate that the proposed MFNet significantly outperforms existing methods in multimodal semantic segmentation, setting a new standard in the field while offering a versatile foundation for future research and applications. The source code for this work is accessible at https://github.com/sstary/SSRS.

Abstract:
Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition‑aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality‑specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre‑trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse‑condition scenarios. CAFuser ranks first on the public MUSES benchmarks, achieving 59.7 PQ for multimodal panoptic and 78.2 mIoU for semantic segmentation, and also sets the new state of the art on DeLiVER. The source code is publicly available at: https://github.com/timbroed/CAFuser.

Abstract:
Semi‑supervised semantic segmentation (SSS) aims at learning rich visual knowledge from cheap unlabeled images to enhance semantic segmentation capability. Among recent works, UniMatch improves its precedents tremendously by amplifying the practice of weak‑to‑strong consistency regularization. Subsequent works typically follow similar pipelines and propose various delicate designs. Despite the achieved progress, strangely, even in this flourishing era of numerous powerful vision models, almost all SSS works are still sticking to 1) using outdated ResNet encoders with small‑scale ImageNet‑1K pre‑training, and 2) evaluation on simple Pascal and Cityscapes datasets. In this work, we argue that, it is necessary to switch the baseline of SSS from ResNet‑based encoders to more capable ViT‑based encoders (e.g., DINOv2) that are pre‑trained on massive data. A simple update on the encoder (even using 2x fewer parameters) can bring more significant improvement than careful method designs. Built on this competitive baseline, we present our upgraded and simplified UniMatch V2, inheriting the core spirit of weak‑to‑strong consistency from V1, but requiring less training cost and providing consistently better results. Additionally, witnessing the gradually saturated performance on Pascal and Cityscapes, we appeal that we should focus on more challenging benchmarks with complex taxonomy, such as ADE20K and COCO datasets. Code, models, and logs of all reported values, are available at https://github.com/LiheYoung/UniMatch‑V2.

Abstract:
Semantic segmentation of point clouds is an essential task for understanding the environment in autonomous driving and robotics. Recent range‑based works achieve real‑time efficiency, while point‑ and voxel‑based methods produce better results but are affected by high computational complexity. Moreover, highly complex deep learning models are often not suited to efficiently learn from small datasets. Their generalization capabilities can easily be driven by the abundance of data rather than the architecture design. In this paper, we harness the information from the three‑dimensional representation to proficiently capture local features, while introducing the range image representation to incorporate additional information and facilitate fast computation. A GPU‑based KDTree allows for rapid building, querying, and enhancing projection with straightforward operations. Extensive experiments on SemanticKITTI and nuScenes datasets demonstrate the benefits of our modification in a ``small data'' setup, in which only one sequence of the dataset is used to train the models, but also in the conventional setup, where all sequences except one are used for training. We show that a reduced version of our model not only demonstrates strong competitiveness against full‑scale state‑of‑the‑art models but also operates in real‑time, making it a viable choice for real‑world case applications. The code of our method is available at https://github.com/Bender97/WaffleAndRange.

Abstract:
In the realm of high‑resolution (HR), fine‑grained image segmentation, the primary challenge is balancing broad contextual awareness with the precision required for detailed object delineation, capturing intricate details and the finest edges of objects. Diffusion models, trained on vast datasets comprising billions of image‑text pairs, such as SD V2.1, have revolutionized text‑to‑image synthesis by delivering exceptional quality, fine detail resolution, and strong contextual awareness, making them an attractive solution for high‑resolution image segmentation. To this end, we propose DiffDIS, a diffusion‑driven segmentation model that taps into the potential of the pre‑trained U‑Net within diffusion models, specifically designed for high‑resolution, fine‑grained object segmentation. By leveraging the robust generalization capabilities and rich, versatile image representation prior of the SD models, coupled with a task‑specific stable one‑step denoising approach, we significantly reduce the inference time while preserving high‑fidelity, detailed generation. Additionally, we introduce an auxiliary edge generation task to not only enhance the preservation of fine details of the object boundaries, but reconcile the probabilistic nature of diffusion with the deterministic demands of segmentation. With these refined strategies in place, DiffDIS serves as a rapid object mask generation model, specifically optimized for generating detailed binary maps at high resolutions, while demonstrating impressive accuracy and swift processing. Experiments on the DIS5K dataset demonstrate the superiority of DiffDIS, achieving state‑of‑the‑art results through a streamlined inference process. The source code will be publicly available at https://github.com/qianyu‑dlut/DiffDIS.

Abstract:
Robotic perception models often fail when deployed in real‑world environments due to out‑of‑distribution conditions such as clutter, occlusion, and novel object instances. Existing approaches address this gap through offline data collection and retraining, which are slow and do not resolve deployment‑time failures. We propose iTeach, a failure‑driven interactive teaching framework for adapting robot perception in the wild. A co‑located human observes model predictions during deployment, identifies failure cases, and performs short human‑object interaction (HumanPlay) to expose informative object configurations while recording RGB‑D sequences. To minimize annotation effort, iTeach employs a Few‑Shot Semi‑ Supervised (FS3) labeling strategy, where only the final frame of a short interaction sequence is annotated using hands‑free eye‑gaze and voice commands, and labels are propagated across the video to produce dense supervision. The collected failure‑driven samples are used for iterative fine‑tuning, enabling progressive deployment‑time adaptation of the perception model. We evaluate iTeach on unseen object instance segmentation (UOIS) starting from a pretrained MSMFormer model. Using a small number of failure‑driven samples, our method significantly improves segmentation performance across diverse real‑world scenes. These improvements directly translate to higher grasping and pick‑and‑place success on the SceneReplica benchmark and real robotic experiments. Our results demonstrate that failure‑driven, co‑located interactive teaching enables efficient in‑the‑wild adaptation of robot perception and improves downstream manipulation performance. Project page at https://irvlutd.github.io/iTeach

Abstract:
Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity‑adaptive attention, such as in DAT, has yielded strong results in image classification, the key‑value pairs selected by deformable points lack semantic relevance when fine‑tuning for semantic segmentation tasks. The query‑aware sparsity attention in BiFormer seeks to focus each query on top‑k routed regions. However, during attention calculation, the selected key‑value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi‑level Routing Attention (DBRA) module, which optimizes the selection of key‑value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi‑level Routing Attention Transformer (DeBiFormer), a novel general‑purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at https://github.com/maclong01/DeBiFormer

Abstract:
The recent advancements in large‑scale pre‑training techniques have significantly enhanced the capabilities of vision foundation models, notably the Segment Anything Model (SAM), which can generate precise masks based on point and box prompts. Recent studies extend SAM to Few‑shot Semantic Segmentation (FSS), focusing on prompt generation for SAM‑based automatic semantic segmentation. However, these methods struggle with selecting suitable prompts, require specific hyperparameter settings for different scenarios, and experience prolonged one‑shot inference times due to the overuse of SAM, resulting in low efficiency and limited automation ability. To address these issues, we propose a simple yet effective approach based on graph analysis. In particular, a Positive‑Negative Alignment module dynamically selects the point prompts for generating masks, especially uncovering the potential of the background context as the negative reference. Another subsequent Point‑Mask Clustering module aligns the granularity of masks and selected points as a directed graph, based on mask coverage over points. These points are then aggregated by decomposing the weakly connected components of the directed graph in an efficient manner, constructing distinct natural clusters. Finally, the positive and overshooting gating, benefiting from graph‑based granularity alignment, aggregate high‑confident masks and filter out the false‑positive masks for final prediction, reducing the usage of additional hyperparameters and redundant mask generation. Extensive experimental analysis across standard FSS, One‑shot Part Segmentation, and Cross Domain FSS datasets validate the effectiveness and efficiency of the proposed approach, surpassing state‑of‑the‑art generalist models with a mIoU of 58.7% on COCO‑20i and 35.2% on LVIS‑92i. The code is available in https://andyzaq.github.io/GF‑SAM/.

Abstract:
We address the challenges of the semi‑supervised LiDAR segmentation (SSLS) problem, particularly in low‑budget scenarios. The two main issues in low‑budget SSLS are the poor‑quality pseudo‑labels for unlabeled data, and the performance drops due to the significant imbalance between ground‑truth and pseudo‑labels. This imbalance leads to a vicious training cycle. To overcome these challenges, we leverage the spatio‑temporal prior by recognizing the substantial overlap between temporally adjacent LiDAR scans. We propose a proximity‑based label estimation, which generates highly accurate pseudo‑labels for unlabeled data by utilizing semantic consistency with adjacent labeled data. Additionally, we enhance this method by progressively expanding the pseudo‑labels from the nearest unlabeled scans, which helps significantly reduce errors linked to dynamic classes. Additionally, we employ a dual‑branch structure to mitigate performance degradation caused by data imbalance. Experimental results demonstrate remarkable performance in low‑budget settings (i.e., <= 5%) and meaningful improvements in normal budget settings (i.e., 5 ‑ 50%). Finally, our method has achieved new state‑of‑the‑art results on SemanticKITTI and nuScenes in semi‑supervised LiDAR segmentation. With only 5% labeled data, it offers competitive results against fully‑supervised counterparts. Moreover, it surpasses the performance of the previous state‑of‑the‑art at 100% labeled data (75.2%) using only 20% of labeled data (76.0%) on nuScenes. The code is available on https://github.com/halbielee/PLE.

Abstract:
Visible and Infrared Image Fusion (VIF) has garnered significant interest across a wide range of high‑level vision tasks, such as object detection and semantic segmentation. However, the evaluation of VIF methods remains challenging due to the absence of ground truth. This paper proposes a Segmentation‑oriented Evaluation Approach (SEA) to assess VIF methods by incorporating the semantic segmentation task and leveraging segmentation labels available in latest VIF datasets. Specifically, SEA utilizes universal segmentation models, capable of handling diverse images and classes, to predict segmentation outputs from fused images and compare these outputs with segmentation labels. Our evaluation of recent VIF methods using SEA reveals that their performance is comparable or even inferior to using visible images only, despite nearly half of the infrared images demonstrating better performance than visible images. Further analysis indicates that the two metrics most correlated to our SEA are the gradient‑based fusion metric Q_\textABF and the visual information fidelity metric Q_\textVIFF in conventional VIF evaluation metrics, which can serve as proxies when segmentation labels are unavailable. We hope that our evaluation will guide the development of novel and practical VIF methods. The code has been released in \urlhttps://github.com/Yixuan‑2002/SEA/.

Abstract:
Recent advancements in State Space Models, notably Mamba, have demonstrated superior performance over the dominant Transformer models, particularly in reducing the computational complexity from quadratic to linear. Yet, difficulties in adapting Mamba from language to vision tasks arise due to the distinct characteristics of visual data, such as the spatial locality and adjacency within images and large variations in information granularity across visual tokens. Existing vision Mamba approaches either flatten tokens into sequences in a raster scan fashion, which breaks the local adjacency of images, or manually partition tokens into windows, which limits their long‑range modeling and generalization capabilities. To address these limitations, we present a new vision Mamba model, coined QuadMamba, that effectively captures local dependencies of varying granularities via quadtree‑based image partition and scan. Concretely, our lightweight quadtree‑based scan module learns to preserve the 2D locality of spatial regions within learned window quadrants. The module estimates the locality score of each token from their features, before adaptively partitioning tokens into window quadrants. An omnidirectional window shifting scheme is also introduced to capture more intact and informative features across different local regions. To make the discretized quadtree partition end‑to‑end trainable, we further devise a sequence masking strategy based on Gumbel‑Softmax and its straight‑through gradient estimator. Extensive experiments demonstrate that QuadMamba achieves state‑of‑the‑art performance in various vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is in https://github.com/VISION‑SJTU/QuadMamba.

Abstract:
Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine‑tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO‑Matting. Specifically, the construction of our COCO‑Matting includes accessory fusion and mask‑to‑matte, which selects real‑world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO‑Matting comprises an extensive collection of 38,251 human instance‑level alpha mattes in complex natural scenarios. Furthermore, existing SAM‑based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end‑to‑end matting losses, which do not fully exploit the potential of the pre‑trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature‑aligned transformer learns to extract fine‑grained edge and transparency features. The proposed matte‑aligned decoder aims to segment matting‑specific objects and convert coarse masks into high‑precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre‑trained model and push the matting logits extracted from the mask decoder to contain trimap‑based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open‑source our code, models, and dataset at https://github.com/XiaRho/SEMat.

Abstract:
Recent approaches in remote sensing have increasingly focused on multimodal data, driven by the growing availability of diverse earth observation datasets. Integrating complementary information from different modalities has shown substantial potential in enhancing semantic understanding. However, existing global multimodal datasets often lack the inclusion of Synthetic Aperture Radar (SAR) data, which excels at capturing texture and structural details. SAR, as a complementary perspective to other modalities, facilitates the utilization of spatial information for global land use and land cover (LULC). To address this gap, we introduce the Dynamic World+ dataset, expanding the current authoritative multispectral dataset, Dynamic World, with aligned SAR data. Additionally, to facilitate the combination of multispectral and SAR data, we propose a lightweight transformer architecture termed SpecSAR‑Former. It incorporates two innovative modules, Dual Modal Enhancement Module (DMEM) and Mutual Modal Aggregation Module (MMAM), designed to exploit cross‑information between the two modalities in a split‑fusion manner. These modules enhance the model's ability to integrate spectral and spatial information, thereby improving the overall performance of global LULC semantic segmentation. Furthermore, we adopt an imbalanced parameter allocation strategy that assigns parameters to different modalities based on their importance and information density. Extensive experiments demonstrate that our network outperforms existing transformer and CNN‑based models, achieving a mean Intersection over Union (mIoU) of 59.58%, an Overall Accuracy (OA) of 79.48%, and an F1 Score of 71.68% with only 26.70M parameters. The code will be available at https://github.com/Reagan1311/LULC_segmentation.

Abstract:
Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large‑scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full‑scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https://github.com/Darkbblue/generic‑diffusion‑feature.

Abstract:
We developed a software pipeline for quality control (QC) of histopathology whole slide images (WSIs) that segments various regions, such as blurs of different levels, tissue regions, tissue folds, and pen marks. Given the necessity and increasing availability of GPUs for processing WSIs, the proposed pipeline comprises multiple lightweight deep learning models to strike a balance between accuracy and speed. The pipeline was evaluated in all TCGAs, which is the largest publicly available WSI dataset containing more than 11,000 histopathological images from 28 organs. It was compared to a previous work, which was not based on deep learning, and it showed consistent improvement in segmentation results across organs. To minimize annotation effort for tissue and blur segmentation, annotated images were automatically prepared by mosaicking patches (sub‑images) from various WSIs whose labels were identified using a patch classification tool HistoROI. Due to the generality of our trained QC pipeline and its extensive testing the potential impact of this work is broad. It can be used for automated pre‑processing any WSI cohort to enhance the accuracy and reliability of large‑scale histopathology image analysis for both research and clinical use. We have made the trained models, training scripts, training data, and inference results publicly available at https://github.com/abhijeetptl5/wsisegqc, which should enable the research community to use the pipeline right out of the box or further customize it to new datasets and applications in the future.

Abstract:
Capturing long‑range dependencies while preserving high‑resolution visual representations is crucial for dense prediction tasks such as human pose estimation. Vision Transformers (ViTs) have advanced global modeling through self‑attention but suffer from quadratic computational complexity with respect to token count, limiting their efficiency and scalability to high‑resolution inputs, especially on mobile and resource‑constrained devices. State Space Models (SSMs), exemplified by Mamba, offer an efficient alternative by combining global receptive fields with linear computational complexity, enabling scalable and resource‑friendly sequence modeling. However, when applied to dense prediction tasks, existing visual SSMs face key limitations: weak spatial inductive bias, long‑range forgetting from hidden state decay, and low‑resolution outputs that hinder fine‑grained localization. To address these issues, we propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi‑scale convolutional operations to enhance local spatial representations and strengthen spatial inductive biases. Through architectural exploration and theoretical analysis, we incorporate deformable operation into the DVSS block, identifying it as an efficient and effective mechanism to enhance semantic aggregation and mitigate long‑range forgetting via input‑dependent, adaptive spatial sampling. We embed DVSS into a multi‑branch high‑resolution architecture to build HRVMamba, a novel model for efficient high‑resolution representation learning. Extensive experiments on human pose estimation, image classification, and semantic segmentation show that HRVMamba performs competitively against leading CNN‑, ViT‑, and SSM‑based baselines. Code is available at https://github.com/zhanghao5201/PoseVMamba.

Abstract:
Contrastive learning has become a dominant approach in self‑supervised visual representation learning, but efficiently leveraging hard negatives, which are samples closely resembling the anchor, remains challenging. We introduce SynCo (Synthetic negatives in Contrastive learning), a novel approach that improves model performance by generating synthetic hard negatives on the representation space. Building on the MoCo framework, SynCo introduces six strategies for creating diverse synthetic hard negatives on‑the‑fly with minimal computational overhead. SynCo achieves faster training and strong representation learning, surpassing MoCo‑v2 by +0.4% and MoCHI by +1.0% on ImageNet ILSVRC‑2012 linear evaluation. It also transfers more effectively to detection tasks achieving strong results on PASCAL VOC detection (57.2% AP) and significantly improving over MoCo‑v2 on COCO detection (+1.0% AP) and instance segmentation (+0.8% AP). Our synthetic hard negative generation approach significantly enhances visual representations learned through self‑supervised contrastive learning.

Abstract:
The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few‑shot Semantic Segmentation. Recently, inspired by the in‑context learning ability of large language models, Few‑shot Semantic Segmentation has evolved into In‑context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few‑shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion‑based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self‑attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re‑evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre‑training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.

Abstract:
3D scene understanding is crucial for facilitating seamless interaction between digital devices and the physical world. Real‑time capturing and processing of the 3D scene are essential for achieving this seamless integration. While existing approaches typically separate acquisition and processing for each frame, the advent of resolution‑scalable 3D sensors offers an opportunity to overcome this paradigm and fully leverage the otherwise wasted acquisition time to initiate processing. In this study, we introduce VX‑S3DIS, a novel point cloud dataset accurately simulating the behavior of a resolution‑scalable 3D sensor. Additionally, we present RESSCAL3D++, an important improvement over our prior work, RESSCAL3D, by incorporating an update module and processing strategy. By applying our method to the new dataset, we practically demonstrate the potential of joint acquisition and semantic segmentation of 3D point clouds. Our resolution‑scalable approach significantly reduces scalability costs from 2% to just 0.2% in mIoU while achieving impressive speed‑ups of 15.6 to 63.9% compared to the non‑scalable baseline. Furthermore, our scalable approach enables early predictions, with the first one occurring after only 7% of the total inference time of the baseline. The new VX‑S3DIS dataset is available at https://github.com/remcoroyen/vx‑s3dis.

Abstract:
Remote sensing image plays an irreplaceable role in fields such as agriculture, water resources, military, and disaster relief. Pixel‑level interpretation is a critical aspect of remote sensing image applications; however, a prevalent limitation remains the need for extensive manual annotation. For this, we try to introduce open‑vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low‑resolution features, distorted target shapes and ill‑fitting boundaries are exhibited in the prediction mask. To tackle this issue, we propose a simple and general upsampler, SimFeatUp, to restore lost spatial information in deep features in a training‑free style. Further, based on the observation of the abnormal response of local patch tokens to [CLS] token in CLIP, we propose to execute a straightforward subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets spanning semantic segmentation, building extraction, road detection, and flood detection tasks. Our method achieves an average of 5.8%, 8.2%, 4.0%, and 15.3% improvement over state‑of‑the‑art methods on 4 tasks. All codes are released. \urlhttps://earth‑insights.github.io/SegEarth‑OV

Abstract:
In this paper, we explore a novel Text‑supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel‑level categories to egocentric images weakly supervised by texts from image‑level labels. In this task with prospective potential, the egocentric scenes contain dense wearer‑object relations and inter‑object interference. However, most recent third‑view methods leverage the frozen Contrastive Language‑Image Pre‑training (CLIP) model, which is pre‑trained on the semantic‑oriented third‑view data and lapses in the egocentric view due to the ``relation insensitive" problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer‑object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large‑scale pre‑trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground‑background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground‑background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at https://github.com/ZhaofengSHI/CTDN.

Abstract:
Subsampling layers play a crucial role in deep nets by discarding a portion of an activation map to reduce its spatial dimensions. This encourages the deep net to learn higher‑level representations. Contrary to this motivation, we hypothesize that the discarded activations are useful and can be incorporated on the fly to improve models' prediction. To validate our hypothesis, we propose a search and aggregate method to find useful activation maps to be used at test time. We applied our approach to the task of image classification and semantic segmentation. Extensive experiments over nine different architectures on multiple datasets show that our method consistently improves model test‑time performance, complementing existing test‑time augmentation techniques. Our code is available at https://github.com/ca‑joe‑yang/discard‑in‑subsampling.

Abstract:
The Area Under the ROC Curve (AUC) is a well‑known metric for evaluating instance‑level long‑tail learning problems. In the past two decades, many AUC optimization methods have been proposed to improve model performance under long‑tail distributions. In this paper, we explore AUC optimization methods in the context of pixel‑level long‑tail semantic segmentation, a much more complicated scenario. This task introduces two major challenges for AUC optimization techniques. On one hand, AUC optimization in a pixel‑level task involves complex coupling across loss terms, with structured inner‑image and pairwise inter‑image dependencies, complicating theoretical analysis. On the other hand, we find that mini‑batch estimation of AUC loss in this case requires a larger batch size, resulting in an unaffordable space complexity. To address these issues, we develop a pixel‑level AUC loss function and conduct a dependency‑graph‑based theoretical analysis of the algorithm's generalization ability. Additionally, we design a Tail‑Classes Memory Bank (T‑Memory Bank) to manage the significant memory demand. Finally, comprehensive experiments across various benchmarks confirm the effectiveness of our proposed AUCSeg method. The code is available at https://github.com/boyuh/AUCSeg.

Abstract:
Large‑scale vision‑language models like CLIP have demonstrated impressive open‑vocabulary capabilities for image‑level tasks, excelling in recognizing what objects are present. However, they struggle with pixel‑level recognition tasks like semantic segmentation, which additionally require understanding where the objects are located. In this work, we propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel‑level understanding by guiding the model on where, which is achieved using unlabeled images and masks generated from vision foundation models such as SAM and DINO. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm using learnable class names to acquire general semantic concepts. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption‑supervised methods in open‑vocabulary semantic segmentation. Project page is available at https://cvlab‑kaist.github.io/PixelCLIP

Abstract:
We introduce VideoLISA, a video‑based multimodal large language model designed to tackle the problem of language‑instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing image‑based methods, such as LISA, struggle with video tasks due to the additional temporal dimension, which requires temporal dynamic understanding and consistent segmentation across frames. VideoLISA addresses these challenges by integrating a Sparse Dense Sampling strategy into the video‑LLM, which balances temporal context and spatial detail within computational constraints. Additionally, we propose a One‑Token‑Seg‑All approach using a specially designed <TRK> token, enabling the model to segment and track objects across multiple frames. Extensive evaluations on diverse benchmarks, including our newly introduced ReasonVOS benchmark, demonstrate VideoLISA's superior performance in video object segmentation tasks involving complex reasoning, temporal understanding, and object tracking. While optimized for videos, VideoLISA also shows promising generalization to image segmentation, revealing its potential as a unified foundation model for language‑instructed object segmentation. Code and model will be available at: https://github.com/showlab/VideoLISA.

Abstract:
Multi‑modal Video Object Segmentation (VOS), including RGB‑Thermal, RGB‑Depth, and RGB‑Event, has garnered attention due to its capability to address challenging scenarios where traditional VOS methods struggle, such as extreme illumination, rapid motion, and background distraction. Existing approaches often involve designing specific additional branches and performing full‑parameter fine‑tuning for fusion in each task. However, this paradigm not only duplicates research efforts and hardware costs but also risks model collapse with the limited multi‑modal annotated data. In this paper, we propose a universal framework named X‑Prompt for all multi‑modal video object segmentation tasks, designated as RGB+X. The X‑Prompt framework first pre‑trains a video object segmentation foundation model using RGB data, and then utilize the additional modality of the prompt to adapt it to downstream multi‑modal tasks with limited data. Within the X‑Prompt framework, we introduce the Multi‑modal Visual Prompter (MVP), which allows prompting foundation model with the various modalities to segment objects precisely. We further propose the Multi‑modal Adaptation Experts (MAEs) to adapt the foundation model with pluggable modality‑specific knowledge without compromising the generalization capacity. To evaluate the effectiveness of the X‑Prompt framework, we conduct extensive experiments on 3 tasks across 4 benchmarks. The proposed universal X‑Prompt framework consistently outperforms the full fine‑tuning paradigm and achieves state‑of‑the‑art performance. Code: https://github.com/PinxueGuo/X‑Prompt.git

Abstract:
The creation of digital replicas of physical objects has valuable applications for the preservation and dissemination of tangible cultural heritage. However, existing methods are often slow, expensive, and require expert knowledge. We propose a pipeline to generate a 3D replica of a scene using only RGB images (e.g. photos of a museum) and then extract a model for each item of interest (e.g. pieces in the exhibit). We do this by leveraging the advancements in novel view synthesis and Gaussian Splatting, modified to enable efficient 3D segmentation. This approach does not need manual annotation, and the visual inputs can be captured using a standard smartphone, making it both affordable and easy to deploy. We provide an overview of the method and baseline evaluation of the accuracy of object segmentation. The code is available at https://mahtaabdn.github.io/gaussian_heritage.github.io/.

Abstract:
This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundation model, has shown potential in various tasks. But its effectiveness in dynamic camouflaged scenarios remains under‑explored. This study presents a comprehensive study on SAM2's ability in VCOS. First, we assess SAM2's performance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, we explore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by fine‑tuning it on the video camouflaged dataset. Our comprehensive experiments demonstrate that SAM2 has excellent zero‑shot ability of detecting camouflaged objects in videos. We also show that this ability could be further improved by specifically adjusting SAM2's parameters for VCOS. The code is available at https://github.com/zhoustan/SAM2‑VCOS

Abstract:
Domain adaptation aims to reduce the model degradation on the target domain caused by the domain shift between the source and target domains. Although encouraging performance has been achieved by combining cognitive learning with the self‑training paradigm, they suffer from ambiguous scenarios caused by scale, illumination, or overlapping when deploying deterministic embedding. To address these issues, we propose probabilistic proto‑typical pixel contrast (PPPC), a universal adaptation framework that models each pixel embedding as a probability via multivariate Gaussian distribution to fully exploit the uncertainty within them, eventually improving the representation quality of the model. In addition, we derive prototypes from probability estimation posterior probability estimation which helps to push the decision boundary away from the ambiguity points. Moreover, we employ an efficient method to compute similarity between distributions, eliminating the need for sampling and reparameterization, thereby significantly reducing computational overhead. Further, we dynamically select the ambiguous crops at the image level to enlarge the number of boundary points involved in contrastive learning, which benefits the establishment of precise distributions for each category. Extensive experimentation demonstrates that PPPC not only helps to address ambiguity at the pixel level, yielding discriminative representations but also achieves significant improvements in both synthetic‑to‑real and day‑to‑night adaptation tasks. It surpasses the previous state‑of‑the‑art (SOTA) by +5.2% mIoU in the most challenging daytime‑to‑nighttime adaptation scenario, exhibiting stronger generalization on other unseen datasets. The code and models are available at https://github.com/DarlingInTheSV/Probabilistic‑Prototypical‑Pixel‑Contrast.

Abstract:
Amodal Instance Segmentation (AIS) presents an intriguing challenge, including the segmentation prediction of both visible and occluded parts of objects within images. Previous methods have often relied on shape prior information gleaned from training data to enhance amodal segmentation. However, these approaches are susceptible to overfitting and disregard object category details. Recent advancements highlight the potential of conditioned diffusion models, pretrained on extensive datasets, to generate images from latent space. Drawing inspiration from this, we propose AISDiff with a Diffusion Shape Prior Estimation (DiffSP) module. AISDiff begins with the prediction of the visible segmentation mask and object category, alongside occlusion‑aware processing through the prediction of occluding masks. Subsequently, these elements are inputted into our DiffSP module to infer the shape prior of the object. DiffSP utilizes conditioned diffusion models pretrained on extensive datasets to extract rich visual features for shape prior estimation. Additionally, we introduce the Shape Prior Amodal Predictor, which utilizes attention‑based feature maps from the shape prior to refine amodal segmentation. Experiments across various AIS benchmarks demonstrate the effectiveness of our AISDiff.

Abstract:
We propose a framework for active mapping and exploration that leverages Gaussian splatting for constructing dense maps. Further, we develop a GPU‑accelerated motion planning algorithm that can exploit the Gaussian map for real‑time navigation. The Gaussian map constructed onboard the robot is optimized for both photometric and geometric quality while enabling real‑time situational awareness for autonomy. We show through viewpoint selection experiments that our method yields comparable Peak Signal‑to‑Noise Ratio (PSNR) and similar reconstruction error to state‑of‑the‑art approaches, while being orders of magnitude faster to compute. In closed‑loop physics‑based simulation and real‑world experiments, our algorithm achieves better map quality (at least 0.8dB higher PSNR and more than 16% higher geometric reconstruction accuracy) than maps constructed by a state‑of‑the‑art method, enabling semantic segmentation using off‑the‑shelf open‑set models. Experiment videos and more details can be found on our project page: https://tyuezhan.github.io/RT GuIDE/

Abstract:
In this report, we present the first place solution to the ECCV 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out‑of‑distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine‑tuning the entire model. This approach outperforms more complex existing approaches, and achieves first place in the challenge. Our code is publicly available at https://github.com/tue‑mps/benchmark‑vfm‑ss.

Abstract:
Robotic waste sorting poses significant challenges in both perception and manipulation, given the extreme variability of objects that should be recognized on a cluttered conveyor belt. While deep learning has proven effective in solving complex tasks, the necessity for extensive data collection and labeling limits its applicability in real‑world scenarios like waste sorting. To tackle this issue, we introduce a data augmentation method based on a novel GAN architecture called wasteGAN. The proposed method allows to increase the performance of semantic segmentation models, starting from a very limited bunch of labeled examples, such as few as 100. The key innovations of wasteGAN include a novel loss function, a novel activation function, and a larger generator block. Overall, such innovations helps the network to learn from limited number of examples and synthesize data that better mirrors real‑world distributions. We then leverage the higher‑quality segmentation masks predicted from models trained on the wasteGAN synthetic data to compute semantic‑aware grasp poses, enabling a robotic arm to effectively recognizing contaminants and separating waste in a real‑world scenario. Through comprehensive evaluation encompassing dataset‑based assessments and real‑world experiments, our methodology demonstrated promising potential for robotic waste sorting, yielding performance gains of up to 5.8% in picking contaminants. The project page is available at https://github.com/bach05/wasteGAN.git

Abstract:
We propose an approach for Open‑World Instance Segmentation (OWIS), a task that aims to segment arbitrary unknown objects in images by generalizing from a limited set of annotated object classes during training. Our Segment Object System (SOS) explicitly addresses the generalization ability and the low precision of state‑of‑the‑art systems, which often generate background detections. To this end, we generate high‑quality pseudo annotations based on the foundation model SAM. We thoroughly study various object priors to generate prompts for SAM, explicitly focusing the foundation model on objects. The strongest object priors were obtained by self‑attention maps from self‑supervised Vision Transformers, which we utilize for prompting SAM. Finally, the post‑processed segments from SAM are used as pseudo annotations to train a standard instance segmentation system. Our approach shows strong generalization capabilities on COCO, LVIS, and ADE20k datasets and improves on the precision by up to 81.6% compared to the state‑of‑the‑art. Source code is available at: https://github.com/chwilms/SOS

Abstract:
Addressing Lidar Panoptic Segmentation (LPS ) is crucial for safe deployment of autonomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre‑defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. the pre‑defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre‑defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class‑specific instance segmentation methods and obtain state‑of‑the‑art results on known classes, methods based on class‑agnostic bottom‑up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on‑par with fully data‑driven methods on known classes. Our work suggests a middle ground: we perform class‑agnostic point clustering and over‑segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network [1]. We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes.

Abstract:
In this work, we study amodal video instance segmentation for automated driving. Previous works perform amodal video instance segmentation relying on methods trained on entirely labeled video data with techniques borrowed from standard video instance segmentation. Such amodally labeled video data is difficult and expensive to obtain and the resulting methods suffer from a trade‑off between instance segmentation and tracking performance. To largely solve this issue, we propose to study the application of foundation models for this task. More precisely, we exploit the extensive knowledge of the Segment Anything Model (SAM), while fine‑tuning it to the amodal instance segmentation task. Given an initial video instance segmentation, we sample points from the visible masks to prompt our amodal SAM. We use a point memory to store those points. If a previously observed instance is not predicted in a following frame, we retrieve its most recent points from the point memory and use a point tracking method to follow those points to the current frame, together with the corresponding last amodal instance mask. This way, while basing our method on an amodal instance segmentation, we nevertheless obtain video‑level amodal instance segmentation results. Our resulting S‑AModal method achieves state‑of‑the‑art results in amodal video instance segmentation while resolving the need for amodal video‑based labels. Code for S‑AModal is available at https://github.com/ifnspaml/S‑AModal.

Abstract:
Instance segmentation plays a vital role in the morphological quantification of biomedical entities such as tissues and cells, enabling precise identification and delineation of different structures. Current methods often address the challenges of touching, overlapping or crossing instances through individual modeling, while neglecting the intrinsic interrelation between these conditions. In this work, we propose a Gradient Anomaly‑aware Biomedical Instance Segmentation approach (GAInS), which leverages instance gradient information to perceive local gradient anomaly regions, thus modeling the spatial relationship between instances and refining local region segmentation. Specifically, GAInS is firstly built on a Gradient Anomaly Mapping Module (GAMM), which encodes the radial fields of instances through window sliding to obtain instance gradient anomaly maps. To efficiently refine boundaries and regions with gradient anomaly attention, we propose an Adaptive Local Refinement Module (ALRM) with a gradient anomaly‑aware loss function. Extensive comparisons and ablation experiments in three biomedical scenarios demonstrate that our proposed GAInS outperforms other state‑of‑the‑art (SOTA) instance segmentation methods. The code is available at https://github.com/DeepGAInS/GAInS.

Abstract:
When employing deep neural networks (DNNs) for semantic segmentation in safety‑critical applications like automotive perception or medical imaging, it is important to estimate their performance at runtime, e.g. via uncertainty estimates or prediction quality estimates. Previous works mostly performed uncertainty estimation on pixel‑level. In a line of research, a connected‑component‑wise (segment‑wise) perspective was taken, approaching uncertainty estimation on an object‑level by performing so‑called meta classification and regression to estimate uncertainty and prediction quality, respectively. In those works, each predicted segment is considered individually to estimate its uncertainty or prediction quality. However, the neighboring segments may provide additional hints on whether a given predicted segment is of high quality, which we study in the present work. On the basis of uncertainty indicating metrics on segment‑level, we use graph neural networks (GNNs) to model the relationship of a given segment's quality as a function of the given segment's metrics as well as those of its neighboring segments. We compare different GNN architectures and achieve a notable performance improvement.

Abstract:
Few‑shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state‑of‑the‑art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few‑shot Semantic Segmentation framework based on the Transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve competitive results on benchmark datasets such as PASCAL‑5^i and COCO‑20^i in both 1‑shot and 5‑shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet

Abstract:
Semantic segmentation is an essential step for many vision applications in order to understand a scene and the objects within. Recent progress in hyperspectral imaging technology enables the application in driving scenarios and the hope is that the devices perceptive abilities provide an advantage over RGB‑cameras. Even though some datasets exist, there is no standard benchmark available to systematically measure progress on this task and evaluate the benefit of hyperspectral data. In this paper, we work towards closing this gap by providing the HyperSpectral Semantic Segmentation benchmark (HS3‑Bench). It combines annotated hyperspectral images from three driving scenario datasets and provides standardized metrics, implementations, and evaluation protocols. We use the benchmark to derive two strong baseline models that surpass the previous state‑of‑the‑art performances with and without pre‑training on the individual datasets. Further, our results indicate that the existing learning‑based methods benefit more from leveraging additional RGB training data than from leveraging the additional hyperspectral channels. This poses important questions for future research on hyperspectral imaging for semantic segmentation in driving scenarios. Code to run the benchmark and the strong baseline approaches are available under https://github.com/nickstheisen/hyperseg.

Abstract:
Vision‑language models have recently evolved into versatile systems capable of high performance across a range of tasks, such as document understanding, visual question answering, and grounding, often in zero‑shot settings. Comics Understanding, a complex and multifaceted field, stands to greatly benefit from these advances. Comics, as a medium, combine rich visual and textual narratives, challenging AI models with tasks that span image classification, object detection, instance segmentation, and deeper narrative comprehension through sequential panels. However, the unique structure of comics ‑‑ characterized by creative variations in style, reading order, and non‑linear storytelling ‑‑ presents a set of challenges distinct from those in other visual‑language domains. In this survey, we present a comprehensive review of Comics Understanding from both dataset and task perspectives. Our contributions are fivefold: (1) We analyze the structure of the comics medium, detailing its distinctive compositional elements; (2) We survey the widely used datasets and tasks in comics research, emphasizing their role in advancing the field; (3) We introduce the Layer of Comics Understanding (LoCU) framework, a novel taxonomy that redefines vision‑language tasks within comics and lays the foundation for future work; (4) We provide a detailed review and categorization of existing methods following the LoCU framework; (5) Finally, we highlight current research challenges and propose directions for future exploration, particularly in the context of vision‑language models applied to comics. This survey is the first to propose a task‑oriented framework for comics intelligence and aims to guide future research by addressing critical gaps in data availability and task definition. A project associated with this survey is available at https://github.com/emanuelevivoli/awesome‑comics‑understanding.

Abstract:
Early diagnosis and treatment of polyps during colonoscopy are essential for reducing the incidence and mortality of Colorectal Cancer (CRC). However, the variability in polyp characteristics and the presence of artifacts in colonoscopy images and videos pose significant challenges for accurate and efficient polyp detection and segmentation. This paper presents a novel approach to polyp segmentation by integrating the Segment Anything Model (SAM 2) with the YOLOv8 model. Our method leverages YOLOv8's bounding box predictions to autonomously generate input prompts for SAM 2, thereby reducing the need for manual annotations. We conducted exhaustive tests on five benchmark colonoscopy image datasets and two colonoscopy video datasets, demonstrating that our method exceeds state‑of‑the‑art models in both image and video segmentation tasks. Notably, our approach achieves high segmentation accuracy using only bounding box annotations, significantly reducing annotation time and effort. This advancement holds promise for enhancing the efficiency and scalability of polyp detection in clinical settings https://github.com/sajjad‑sh33/YOLO_SAM2.

Abstract:
The study of social interactions and collective behaviors through multi‑agent video analysis is crucial in biology. While self‑supervised keypoint discovery has emerged as a promising solution to reduce the need for manual keypoint annotations, existing methods often struggle with videos containing multiple interacting agents, especially those of the same species and color. To address this, we introduce B‑KinD‑multi, a novel approach that leverages pre‑trained video segmentation models to guide keypoint discovery in multi‑agent scenarios. This eliminates the need for time‑consuming manual annotations on new experimental settings and organisms. Extensive evaluations demonstrate improved keypoint regression and downstream behavioral classification in videos of flies, mice, and rats. Furthermore, our method generalizes well to other species, including ants, bees, and humans, highlighting its potential for broad applications in automated keypoint annotation for multi‑agent behavior analysis. Code available under: https://danielpkhalil.github.io/B‑KinD‑Multi

Abstract:
Medical image segmentation, a critical application of semantic segmentation in healthcare, has seen significant advancements through specialized computer vision techniques. While deep learning‑based medical image segmentation is essential for assisting in medical diagnosis, the lack of diverse training data causes the long‑tail problem. Moreover, most previous hybrid CNN‑ViT architectures have limited ability to combine various attentions in different layers of the Convolutional Neural Network. To address these issues, we propose a Lagrange Duality Consistency (LDC) Loss, integrated with Boundary‑Aware Contrastive Loss, as the overall training objective for semi‑supervised learning to mitigate the long‑tail problem. Additionally, we introduce CMAformer, a novel network that synergizes the strengths of ResUNet and Transformer. The cross‑attention block in CMAformer effectively integrates spatial attention and channel attention for multi‑scale feature fusion. Overall, our results indicate that CMAformer, combined with the feature fusion framework and the new consistency loss, demonstrates strong complementarity in semi‑supervised learning ensembles. We achieve state‑of‑the‑art results on multiple public medical image datasets. Example code are available at: \urlhttps://github.com/lzeeorno/Lagrange‑Duality‑and‑CMAformer.

Abstract:
Medical image segmentation, a crucial task in computer vision, facilitates the automated delineation of anatomical structures and pathologies, supporting clinicians in diagnosis, treatment planning, and disease monitoring. Notably, transformers employing shifted window‑based self‑attention have demonstrated exceptional performance. However, their reliance on local window attention limits the fusion of local and global contextual information, crucial for segmenting microtumors and miniature organs. To address this limitation, we propose the Adaptive Semantic Segmentation Network (ASSNet), a transformer architecture that effectively integrates local and global features for precise medical image segmentation. ASSNet comprises a transformer‑based U‑shaped encoder‑decoder network. The encoder utilizes shifted window self‑attention across five resolutions to extract multi‑scale features, which are then propagated to the decoder through skip connections. We introduce an augmented multi‑layer perceptron within the encoder to explicitly model long‑range dependencies during feature extraction. Recognizing the constraints of conventional symmetrical encoder‑decoder designs, we propose an Adaptive Feature Fusion (AFF) decoder to complement our encoder. This decoder incorporates three key components: the Long Range Dependencies (LRD) block, the Multi‑Scale Feature Fusion (MFF) block, and the Adaptive Semantic Center (ASC) block. These components synergistically facilitate the effective fusion of multi‑scale features extracted by the decoder while capturing long‑range dependencies and refining object boundaries. Comprehensive experiments on diverse medical image segmentation tasks, including multi‑organ, liver tumor, and bladder tumor segmentation, demonstrate that ASSNet achieves state‑of‑the‑art results. Code and models are available at: \urlhttps://github.com/lzeeorno/ASSNet.

Abstract:
Open‑vocabulary image semantic segmentation (OVS) seeks to segment images into semantic regions across an open set of categories. Existing OVS methods commonly depend on foundational vision‑language models and utilize similarity computation to tackle OVS tasks. However, these approaches are predominantly tailored to natural images and struggle with the unique characteristics of remote sensing images, such as rapidly changing orientations and significant scale variations. These challenges complicate OVS tasks in earth vision, requiring specialized approaches. To tackle this dilemma, we propose the first OVS framework specifically designed for remote sensing imagery, drawing inspiration from the distinct remote sensing traits. Particularly, to address the varying orientations, we introduce a rotation‑aggregative similarity computation module that generates orientation‑adaptive similarity maps as initial semantic maps. These maps are subsequently refined at both spatial and categorical levels to produce more accurate semantic maps. Additionally, to manage significant scale changes, we integrate multi‑scale image features into the upsampling process, resulting in the final scale‑aware semantic masks. To advance OVS in earth vision and encourage reproducible research, we establish the first open‑sourced OVS benchmark for remote sensing imagery, including four public remote sensing datasets. Extensive experiments on this benchmark demonstrate our proposed method achieves state‑of‑the‑art performance. All codes and datasets are available at https://github.com/caoql98/OVRS.

Abstract:
The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM consists of Multi‑Scale Aggregation (MSA) and Guided Local Enhancement (GLE) striking a proper balance between long‑range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet‑1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance. The code has been released at https://github.com/Yonghao‑Yu/StepwisePatchMerging.

Abstract:
Due to their text‑to‑image synthesis feature, diffusion models have recently seen a rise in visual perception tasks, such as depth estimation. The lack of good‑quality datasets makes the extraction of a fine‑grain semantic context challenging for the diffusion models. The semantic context with fewer details further worsens the process of creating effective text embeddings that will be used as input for diffusion models. In this paper, we propose a novel EDADepth, an enhanced data augmentation method to estimate monocular depth without using additional training data. We use Swin2SR, a super‑resolution model, to enhance the quality of input images. We employ the BEiT pre‑trained semantic segmentation model for better extraction of text embeddings. We use BLIP‑2 tokenizer to generate tokens from these text embeddings. The novelty of our approach is the introduction of Swin2SR, the BEiT model, and the BLIP‑2 tokenizer in the diffusion‑based pipeline for the monocular depth estimation. Our model achieves state‑of‑the‑art results (SOTA) on the delta3 metric on NYUv2 and KITTI datasets. It also achieves results comparable to those of the SOTA models in the RMSE and REL metrics. Finally, we also show improvements in the visualization of the estimated depth compared to the SOTA diffusion‑based monocular depth estimation models. Code: https://github.com/edadepthmde/EDADepth_ICMLA.

Abstract:
Data augmentation is crucial for pixel‑wise annotation tasks like semantic segmentation, where labeling requires significant effort and intensive labor. Traditional methods, involving simple transformations such as rotations and flips, create new images but often lack diversity along key semantic dimensions and fail to alter high‑level semantic properties. To address this issue, generative models have emerged as an effective solution for augmenting data by generating synthetic images. Controllable Generative models offer data augmentation methods for semantic segmentation tasks by using prompts and visual references from the original image. However, these models face challenges in generating synthetic images that accurately reflect the content and structure of the original image due to difficulties in creating effective prompts and visual references. In this work, we introduce an effective data augmentation pipeline for semantic segmentation using Controllable Diffusion model. Our proposed method includes efficient prompt generation using Class‑Prompt Appending and Visual Prior Blending to enhance attention to labeled classes in real images, allowing the pipeline to generate a precise number of augmented images while preserving the structure of segmentation‑labeled classes. In addition, we implement a class balancing algorithm to ensure a balanced training dataset when merging the synthetic and original images. Evaluation on PASCAL VOC datasets, our pipeline demonstrates its effectiveness in generating high‑quality synthetic images for semantic segmentation. Our code is available at https://github.com/chequanghuy/Enhanced‑Generative‑Data‑Augmentation‑for‑Semantic‑Segmentation‑via‑Stronger‑Guidance.

Abstract:
Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes. In this paper, we introduce the 6th Large‑scale Video Object Segmentation (LSVOS) challenge in conjunction with ECCV 2024 workshop. This year's challenge includes two tasks: Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). In this year, we replace the classic YouTube‑VOS and YouTube‑RVOS benchmark with latest datasets MOSE, LVOS, and MeViS to assess VOS under more challenging complex environments. This year's challenge attracted 129 registered teams from more than 20 institutes across over 8 countries. This report include the challenge and dataset introduction, and the methods used by top 7 teams in two tracks. More details can be found in our homepage https://lsvos.github.io/.

Abstract:
Spiking Neural Networks (SNNs) have emerged as a promising substitute for Artificial Neural Networks (ANNs) due to their advantages of fast inference and low power consumption. However, the lack of efficient training algorithms has hindered their widespread adoption. Even efficient ANN‑SNN conversion methods necessitate quantized training of ANNs to enhance the effectiveness of the conversion, incurring additional training costs. To address these challenges, we propose an efficient ANN‑SNN conversion framework with only inference scale complexity. The conversion framework includes a local threshold balancing algorithm, which enables efficient calculation of the optimal thresholds and fine‑grained adjustment of the threshold value by channel‑wise scaling. We also introduce an effective delayed evaluation strategy to mitigate the influence of the spike propagation delays. We demonstrate the scalability of our framework in typical computer vision tasks: image classification, semantic segmentation, object detection, and video classification. Our algorithm outperforms existing methods, highlighting its practical applicability and efficiency. Moreover, we have evaluated the energy consumption of the converted SNNs, demonstrating their superior low‑power advantage compared to conventional ANNs. This approach simplifies the deployment of SNNs by leveraging open‑source pre‑trained ANN models, enabling fast, low‑power inference with negligible performance reduction. Code is available at https://github.com/putshua/Inference‑scale‑ANN‑SNN.

Abstract:
Enabled by large annotated datasets, tracking and segmentation of objects in videos has made remarkable progress in recent years. Despite these advancements, algorithms still struggle under degraded conditions and during fast movements. Event cameras are novel sensors with high temporal resolution and high dynamic range that offer promising advantages to address these challenges. However, annotated data for developing learning‑based mask‑level tracking algorithms with events is not available. To this end, we introduce: (i) a new task termed \emphspace‑time instance segmentation, similar to video instance segmentation, whose goal is to segment instances throughout the entire duration of the sensor input (here, the input are quasi‑continuous events and optionally aligned frames); and (ii) \emph\dname, a dataset for the new task, containing aligned grayscale frames and events. It includes annotated ground‑truth labels (pixel‑level instance segmentation masks) of a group of up to seven freely moving and interacting mice. We also provide two reference methods, which show that leveraging event data can consistently improve tracking performance, especially when used in combination with conventional cameras. The results highlight the potential of event‑aided tracking in difficult scenarios. We hope our dataset opens the field of event‑based video instance segmentation and enables the development of robust tracking algorithms for challenging conditions.\urlhttps://github.com/tub‑rip/MouseSIS

Abstract:
Partially‑supervised multi‑organ medical image segmentation aims to develop a unified semantic segmentation model by utilizing multiple partially‑labeled datasets, with each dataset providing labels for a single class of organs. However, the limited availability of labeled foreground organs and the absence of supervision to distinguish unlabeled foreground organs from the background pose a significant challenge, which leads to a distribution mismatch between labeled and unlabeled pixels. Although existing pseudo‑labeling methods can be employed to learn from both labeled and unlabeled pixels, they are prone to performance degradation in this task, as they rely on the assumption that labeled and unlabeled pixels have the same distribution. In this paper, to address the problem of distribution mismatch, we propose a labeled‑to‑unlabeled distribution alignment (LTUDA) framework that aligns feature distributions and enhances discriminative capability. Specifically, we introduce a cross‑set data augmentation strategy, which performs region‑level mixing between labeled and unlabeled organs to reduce distribution discrepancy and enrich the training set. Besides, we propose a prototype‑based distribution alignment method that implicitly reduces intra‑class variation and increases the separation between the unlabeled foreground and background. This can be achieved by encouraging consistency between the outputs of two prototype classifiers and a linear classifier. Extensive experimental results on the AbdomenCT‑1K dataset and a union of four benchmark datasets (including LiTS, MSD‑Spleen, KiTS, and NIH82) demonstrate that our method outperforms the state‑of‑the‑art partially‑supervised methods by a considerable margin, and even surpasses the fully‑supervised methods. The source code is publicly available at https://github.com/xjiangmed/LTUDA.

Abstract:
Stable diffusion has demonstrated strong image synthesis ability to given text descriptions, suggesting it to contain strong semantic clue for grouping objects. The researchers have explored employing stable diffusion for training‑free segmentation. Most existing approaches refine cross‑attention map by self‑attention map once, demonstrating that self‑attention map contains useful semantic information to improve segmentation. To fully utilize self‑attention map, we present a deep experimental analysis on iteratively refining cross‑attention map with self‑attention map, and propose an effective iterative refinement framework for training‑free segmentation, named iSeg. The proposed iSeg introduces an entropy‑reduced self‑attention module that utilizes a gradient descent scheme to reduce the entropy of self‑attention map, thereby suppressing the weak responses corresponding to irrelevant global information. Leveraging the entropy‑reduced self‑attention module, our iSeg stably improves refined cross‑attention map with iterative refinement. Further, we design a category‑enhanced cross‑attention module to generate accurate cross‑attention map, providing a better initial input for iterative refinement. Extensive experiments across different datasets and diverse segmentation tasks reveal the merits of proposed contributions, leading to promising performance on diverse segmentation tasks. For unsupervised semantic segmentation on Cityscapes, our iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training‑free approach in literature. Moreover, our proposed iSeg can support segmentation with different kinds of images and interactions. The project is available at https://linsun449.github.io/iSeg.

Abstract:
Adverse conditions like snow, rain, nighttime, and fog, pose challenges for autonomous driving perception systems. Existing methods have limited effectiveness in improving essential computer vision tasks, such as semantic segmentation, and often focus on only one specific condition, such as removing rain or translating nighttime images into daytime ones. To address these limitations, we propose a method to improve the visual quality and clarity degraded by such adverse conditions. Our method, AllWeather‑Net, utilizes a novel hierarchical architecture to enhance images across all adverse conditions. This architecture incorporates information at three semantic levels: scene, object, and texture, by discriminating patches at each level. Furthermore, we introduce a Scaled Illumination‑aware Attention Mechanism (SIAM) that guides the learning towards road elements critical for autonomous driving perception. SIAM exhibits robustness, remaining unaffected by changes in weather conditions or environmental scenes. AllWeather‑Net effectively transforms images into normal weather and daytime scenes, demonstrating superior image enhancement results and subsequently enhancing the performance of semantic segmentation, with up to a 5.3% improvement in mIoU in the trained domain. We also show our model's generalization ability by applying it to unseen domains without re‑training, achieving up to 3.9% mIoU improvement. Code can be accessed at: https://github.com/Jumponthemoon/AllWeatherNet.

Abstract:
Visual affordance segmentation identifies image regions of an object an agent can interact with. Existing methods re‑use and adapt learning‑based architectures for semantic segmentation to the affordance segmentation task and evaluate on small‑size datasets. However, experimental setups are often not reproducible, thus leading to unfair and inconsistent comparisons. In this work, we benchmark these methods under a reproducible setup on two single objects scenarios, tabletop without occlusions and hand‑held containers, to facilitate future comparisons. We include a version of a recent architecture, Mask2Former, re‑trained for affordance segmentation and show that this model is the best‑performing on most testing sets of both scenarios. Our analysis shows that models are not robust to scale variations when object resolutions differ from those in the training set.

Abstract:
Weakly supervised image segmentation (WSSS) from image tags remains challenging due to its under‑constraint nature. Most mainstream work focus on the extraction of class activation map (CAM) and imposing various additional regularization. Contrary to the mainstream, we propose to frame WSSS as a problem of reconstruction from decomposition of the image using its mask, under which most regularization are embedded implicitly within the framework of the new problem. Our approach has demonstrated promising results on initial experiments, and shown robustness against the problem of background ambiguity. Our code is available at \urlhttps://github.com/xuanrui‑work/WSSSByRec.

Abstract:
Out‑of‑Distribution (OOD) detection in computer vision is a crucial research area, with related benchmarks playing a vital role in assessing the generalizability of models and their applicability in real‑world scenarios. However, existing OOD benchmarks in the literature suffer from two main limitations: (1) they often overlook semantic shift as a potential challenge, and (2) their scale is limited compared to the large datasets used to train modern models. To address these gaps, we introduce SOOD‑ImageNet, a novel dataset comprising around 1.6M images across 56 classes, designed for common computer vision tasks such as image classification and semantic segmentation under OOD conditions, with a particular focus on the issue of semantic shift. We ensured the necessary scalability and quality by developing an innovative data engine that leverages the capabilities of modern vision‑language models, complemented by accurate human checks. Through extensive training and evaluation of various models on SOOD‑ImageNet, we showcase its potential to significantly advance OOD research in computer vision. The project page is available at https://github.com/bach05/SOODImageNet.git.

Abstract:
Recent advances in large pre‑trained models have led to remarkable progress in instance segmentation on general images. However, industrial scenarios remain challenging. Instance definitions are often application‑specific and inconsistent, and the domain gap from general imagery is substantial due to weak textures and limited contextual cues. Consequently, a direct application of existing models is unreliable. We propose Boundary‑by‑Mask, a few‑shot instance segmentation framework that supervises boundaries instead of interior appearance. Given a few RGB images and corresponding instance masks, the method extracts rich visual features using a foundation‑model encoder and trains a lightweight Signed Distance Function (SDF) head to predict boundary‑aware distance maps. Segmentation masks are obtained through an SDF‑to‑mask reconstruction process. By explicitly estimating contours, the framework achieves reliable instance separation even on low‑texture and color‑uniform surfaces. The instance definition is conditioned by the instance mask. Replacing the mask specifies the segmentation target, such as the whole object or a sub‑part. A pixel‑wise shallow MLP head enables rapid training. Experiments on industrial parts and food items with ambiguous boundaries show strong few‑shot generalization, robustness in feature‑poor conditions, and precise control over mask‑level targets.

Abstract:
The ocean plays a critical role in sustainable development, particularly in climate change mitigation. Among marine ecosystems, blue carbon ecosystems are recognized as important natural carbon sinks. In this context, this paper addresses precise seaweed classification for blue carbon quantification in Ocean Digital Twin initiatives. Conventional methods, including supervised learning (limited by data scarcity and domain gaps) and self‑supervised learning (unable to assign class labels), struggle with underwater complexities and diverse seaweed species. To overcome this, we propose a novel two‑stage seaweed segmentation technique. This technique first utilizes Supervised and Self‑supervised Learning Model Propagation (SSL.Prop.), which leverages supervised learning for initial class information and approximate locations, guiding self‑supervised learning for detailed, accurate segmentation. Subsequently, MaskFusion (MF) refines these results by merging instance‑level masks for highly accurate segmentation. This integrated approach allows automatic class label assignment and mitigates domain gap effects. Specifically, instance segmentation estimates sparse point locations which then guide self‑supervised learning for detailed region segmentation. Evaluated with underwater images from Yamaguchi Prefecture, our full proposed method (SSL.Prop.+MF) achieved a 0.068 mIoU improvement over USIS‑SAM, demonstrating significant accuracy gains, particularly for small seaweed. This approach demonstrates strong potential for improving blue carbon quantification and marine ecosystem monitoring.

Abstract:
Earth observation imagery plays a critical role in environmental monitoring, urban planning, disaster assessment, and climate analysis. While multi‑spectral sensors are increasingly available, true‑color (RGB) imagery remains widely used due to the power, cost, and deployment constraints of many satellite and aerial platforms. However, existing land‑cover segmentation datasets are often limited in geographic coverage, scale, or public accessibility. To bridge this gap, we introduce BELDE (Building a Large‑scale Earth‑observation Land‑cover Dataset for Europe), a publicly available dataset tailored for RGB‑based remote sensing semantic segmentation. Constructed from Sentinel‑2 true‑color images and ESA WorldCover data annotations, BELDE contains 1,088,385 curated image‑segmentation map pairs spanning Europe with 7 land‑cover classes at 10 m spatial resolution, making it one of the largest publicly available RGB land‑cover segmentation datasets for Earth observation. To facilitate cross‑region generalization studies, we additionally introduce BELDE‑K (16,607 pairs) covering the Republic of Korea and BELDE‑CA‑NV (88,155 pairs) covering California and Nevada in the United States. We establish baseline results using multiple semantic segmentation architectures and evaluate both in‑domain and cross‑domain performance. Models trained on BELDE achieve an F1 score of 83.0% on the European test set, while performance decreases to 66.4% on BELDE‑CA‑NV and 58.3% on BELDE‑K, highlighting the challenges posed by out‑of‑distribution geographic domain shift. By providing a continental‑scale RGB segmentation and evaluation benchmark, BELDE supports the development of robust and transferable Earth observation models. The dataset and benchmark resources will be publicly released.

Abstract:
Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi‑online) VIS approaches outperform single‑image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past‑frames Feature Propagation (PFP) which aggregates low‑dimensional features from the image encoder of multiple frames. This simple low‑compute module provides tremendous learning capability in using sparse video frame labels for end‑to‑end training. Combined with a light‑weight frame‑specific Instance Queries, our Sparse frame Annotation VIS (SA‑VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA‑VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA‑VIS shows strong improvements over the baseline on YouTube‑VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state‑of‑the‑art in a limited annotations scenario.

Abstract:
We describe our 4th‑place entry to the ICRA 2026 GOOSE 2D Fine‑Grained Semantic Segmentation Challenge, which reached a composite mean Intersection‑over‑Union (mIoU) of 69.73% on the official 1,815‑image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self‑distillation scheme that re‑uses SAM3 itself, prompted with ground‑truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image‑level multi‑scale test‑time augmentation scheme that restores multi‑scale inference for a fixed‑input‑size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

Abstract:
Current machine learning models commonly require large and well‑annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high‑resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high‑resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre‑annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.

Abstract:
Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per‑tree annotations, but none include co‑registered 4D imaging radar ‑‑ a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi‑sensor forest dataset collected by a mobile robot equipped with a high‑resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK‑GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations ‑‑ including per‑tree diameter estimates ‑‑ provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross‑modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter‑stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co‑registered multi‑modal data and RTK‑GNSS‑aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

Abstract:
We present an automated approach to distinguish between ply instances in semantic segmentation masks of high‑resolution carbon‑fiber reinforced polymer micrographs. Interpreting the segmentation mask as a graph with pixels as vertices, enables us to use a shortest‑path algorithm yielding the ply‑separating paths. Thereby, we bridge the gap between semantic segmentation and ply instance segmentation using global information. We successfully apply our approach on high‑resolution micrographs featuring a broad range of characteristics like artificially added gaps in single or multiple plies, different stacking sequences and ply traversing cracks. Assigning each fiber pixel to a ply based on the calculated paths, allows for a comprehensive, quantitative ply analysis with respect to its microstructural properties like the local fiber volume fraction as well as locally resolved ply and interleaf layer thickness. These insights help to reveal manufacturing‑induced inhomogeneities, draw conclusions on manufacturing parameters and link mechanical properties to underlying microstructural imperfections.

Abstract:
Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high‑level geoscientific semantics, while general‑purpose vision‑language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction‑driven agentic framework comprising three components. First, LandslideBench, a multimodal fine‑grained dataset with seven subtype labels, high‑resolution imagery, pixel‑level masks, and high‑quality textual descriptions, is constructed via multi‑VLM cross‑validation and interactive annotation. Then, LandslideVLM, a landslide‑oriented VLM, is fine‑tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule‑enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual‑rule controller incorporating structured report metadata constraints and cross‑validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine‑grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine‑grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi‑source spatial data inference, realizing full‑process intelligence for landslide identification and analysis.

Abstract:
The GOOSE 2D Fine‑Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off‑road imagery over a fine‑grained taxonomy of 64 classes and 11 evaluated non‑void coarse categories. We present the first‑place solution to this challenge. Our solution comprises two complementary improvements: (a) a network‑level design that combines a self‑supervised DINOv3 ViT‑L/16 backbone, a ViT‑Adapter, and a Mask2Former mask‑classification decoder, together with a coarse‑category auxiliary loss on the global [CLS] token; and (b) an inference‑time aggregation strategy based on multi‑scale and horizontal‑flip test‑time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine‑class mIoU and 83.81% category‑level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results‑tab.

Abstract:
The Vines‑DB dataset contains 1,218 original high‑resolution RGB images of seven ornamental vine species collected under field conditions at the Utah Agricultural Experiment Station's Greenville Research Farm in Logan, Utah, USA. The dataset was generated from 168 individual vine plants that were transplanted in 2022 and photographed repeatedly across multiple months during the 2023 and 2024 growing seasons (July‑October). Images were captured with an iPhone 16 Pro equipped with a 48 MP camera between 10:00 AM and 12:00 PM under daylight. Vines were grown on 1.2m x 2.4m trellises and photographed from a distance of 1m against black or white Styrofoam backdrops to improve contrast and reduce background noise. The dataset includes Akebia quinata, Campsis radicans, Hydrangea anomala petiolaris, Lonicera x heckrottii, Campsis x tagliabuana 'Madame Galen', Parthenocissus quinquefolia, and Wisteria floribunda. All original images were manually annotated in Roboflow by trained annotators to produce polygon‑based instance segmentation masks for eight classes, including seven species and background. After preprocessing and data augmentation, the working dataset was expanded to 2,307 images for model development and evaluation. The augmented dataset was divided into 2,019 training images, 192 validation images, and 96 test images using stratified sampling to maintain balanced representation. Vines‑DB supports the development and evaluation of deep learning models for multi‑class instance segmentation in precision horticulture and urban ecology. The dataset enables applications such as automated canopy cover estimation, species identification, and scalable field phenotyping. In addition, repeated monthly imaging of the plants captures temporal variation in canopy development and plant appearance, increasing the dataset's utility for segmentation benchmarking under realistic field conditions.

Abstract:
Mamba‑based state space models offer linear‑time long‑range modeling for high‑resolution dense prediction, but sequential state‑space propagation can attenuate boundary‑sensitive and detail‑sensitive responses that are critical in multi‑class semantic segmentation. We propose Reload‑Mamba, a semantic segmentation framework that addresses this propagation‑induced response dilution through three segmentation‑specific designs: (i) a boundary‑supervised local detail prior that is explicitly trained with ground‑truth boundary masks to identify regions requiring response restoration; (ii) a class‑uncertainty‑aware Reload Gate that incorporates per‑pixel class entropy from a pre‑reload auxiliary head as an additional gating signal, a formulation that is informative only under multi‑class dense prediction; and (iii) a hierarchical multi‑level Reload mechanism that applies anti‑dilution refinement at three decoder levels and fuses the restored representations top‑down. Built upon a ConvNeXt‑Tiny encoder with a multi‑scale decoder and four‑directional Mamba scanning with pixel‑wise directional attention, Reload‑Mamba achieves 47.9% single‑scale (48.9% multi‑scale) mIoU on ADE20K and 83.2% single‑scale mIoU on Cityscapes. With ResNet‑101 + COCO pre‑training under the standard DeepLab‑style protocol, Reload‑Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation‑specific designs contributes beyond a direct port of the prior anti‑dilution architecture proposed for binarization, cumulatively improving over the direct‑port baseline by +2.2 mIoU on ADE20K.

Abstract:
Characterising the tumour microenvironment (TME) from routine H&E‑stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME‑UNI2, a unified framework addressing these requirements. Its core is UNI2‑UPERHOVER, a dual‑head segmentation model pairing the UNI2‑H pathology foundation model (ViT‑Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six‑class semantic segmentation and one for horizontal‑vertical gradient regression enabling watershed‑based nuclear instance separation. To address the lack of pixel‑level annotations in large real‑world repositories, UNI2‑UPERHOVER undergoes a three‑stage progressive pseudo‑label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo‑label quality: Stage 1: Uses human‑annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy‑filtered pseudo‑labels from the Stage 1 model on 271,711 TCGA‑UT scale‑0 patches (0.5 um/pixel). Stage 3: Uses pseudo‑labels from the Stage 2 model on all 1,608,060 TCGA‑UT patches across six resolution scales (0.5‑1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per‑patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine‑tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held‑out PanNuke and TCGA‑UT partitions demonstrates framework feasibility and internal consistency. The pseudo‑labelled TCGA‑UT dataset and UNI2‑UPERHOVER checkpoint are publicly released to support large‑scale TME profiling and spatial biology research.

Abstract:
While video segmentation has advanced rapidly on short clips and closed‑set benchmarks, open‑world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego‑motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero‑shot open‑world long‑horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long‑range identity maintenance. We further propose OGA, a granularity‑aware evaluation suite for open‑world video segmentation. Built on a Granularity‑Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over‑rewarded while enabling GA‑adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open‑world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long‑horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ_\infty, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open‑world long‑horizon video segmentation.

Abstract:
Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre‑flight, adding new semantic capabilities in orbit requires retraining and re‑uploading parameters. We investigate whether prompt‑driven vision‑‑language models can enable post‑launch semantic expansion, allowing new spacecraft components to be specified via natural‑language prompts without modifying onboard weights. We evaluate zero‑shot instance segmentation of spacecraft components under a strictly frozen, single‑pass inference protocol on a test set of 129 images of previously unseen satellites. Under fixed global thresholds and no post‑processing, SAM3 achieves 0.385 mAP@0.5 and 0.267 mAP@0.5:0.95. Performance is strongly scale‑dependent: large structural elements like spacecraft bodies (0.639 AP@0.50) and solar arrays (0.598 AP@0.5) localize reliably, while relatively small appendages like antennas (0.221 AP@0.5) and thrusters (0.081 AP@0.5) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to 82% improvement over short category‑name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt‑driven grounding can provide a practical mechanism for post‑launch semantic extension of dominant spacecraft structures while highlighting limitations of zero‑shot localization for fine‑scale components under orbital domain shift.

Abstract:
Semantic segmentation is a fundamental component of visual perception in modern automotive systems, enabling pixel‑level scene understanding. Near‑Infrared imaging (NIR) offers stable detection under difficult illumination conditions, but the development of domain‑specific semantic segmentation models remains challenging due to the lack of high‑quality annotated data from real‑world scenarios. Synthetic datasets offer a scalable alternative, but models trained on synthetic images often suffer performance degradation when transferred to real domains. We present the first systematic study on synthetic to real domain adaptation for semantic segmentation in NIR images in the automotive domain. We propose a generative augmentation framework that transforms synthetic images into realistic NIR‑style variants via our introduced target style adaptation (TSA). TSA fine‑tunes a latent diffusion model via low‑rank adaptation on a small curated set of real NIR images and applies it to synthetic training data using structure‑preserving multi‑signal conditioning. To reduce texture bias and improve segmentation robustness, we further apply a Voronoi‑based style diversification strategy (VSD) that modifies the original textures while preserving scene geometry. Experiments with multiple model architectures on NIR data from vehicle interiors and street scenes show that balancing inductive bias during training leads to noticeably more robust semantic segmentation and effectively reduces the domain gap in our real‑world scenarios by up to 63.6% on exterior and 28.4% on interior data. The code is available at GitHub.

Abstract:
As Uncrewed Aerial Vehicles (UAVs) transition toward higher levels of autonomy, the ability to perform unassisted recovery in non‑cooperative, unstructured environments becomes critical. Achieving safe autonomous landing requires high‑fidelity semantic resolution to distinguish navigable terrain from hazardous obstacles, yet development is often hindered by the scarcity of annotated aerial datasets. This work proposes a comprehensive perception and data generation pipeline designed to bridge the sim‑to‑real gap for autonomous landing tasks. We introduce a procedural synthetic data engine that generates photorealistic urban environments with automated semantic annotations through domain randomization. A Transformer‑based OneFormer architecture is fine‑tuned exclusively on this synthetic data, leveraging multi‑head self‑attention mechanisms for global context resolution. To ensure operational safety, a deterministic landing module utilizes a Euclidean Distance Transform (EDT) and dynamic inference logic to identify the largest inscribed safe landing zones while maintaining strict clearance buffers around obstacles. Quantitative benchmarking against the UAVid dataset demonstrates robust semantic segmentation performance, while qualitative validation on real‑world UAV footage confirms the system's ability to identify collision‑free landing sites in unseen environments. Our results highlight the potential of high‑fidelity procedural simulation to eliminate the need for manual annotation while providing robust, edge‑deployable situational awareness for autonomous UAV recovery.

Abstract:
Sociable weaver nests function as complex ecological structures offering thermoregulatory microhabitats and sustaining diverse species; however, datasets used in prior studies lack fine‑grained 3D structural detail. Producing usable and accurate 3D weaver nest data is challenging due to their irregular geometry and integration with complex host vegetation. We bridge this gap with an open‑access, 1.4 TB multimodal drone dataset of 104 nest‑bearing trees, comprising 27,945 RGB images, 111,780 multispectral images, approximately 781 million 3D points, and expert‑annotated semantic segmentation labels. We benchmark semantic segmentation using KPConv, RandLA‑Net, and Point Transformer V3, with PT‑v3 achieving an mIoU of 86.35% on the test set. While the results demonstrate strong performance for transformer‑based and point‑wise methods, they also highlight architecture‑dependent challenges, particularly for convolution‑based approaches such as KPConv. By uniquely combining spectral, spatial, and structural information, the presented dataset advances 3D reconstruction, segmentation, and classification algorithms, enabling ecological applications from nest volume estimation to species conservation, and serves as a demanding benchmark that exposes architecture‑dependent performance under extreme class imbalance.

Abstract:
Simultaneous 3D reconstruction and 6D object pose estimation from a single monocular image is an inherently ill‑posed problem. In industrial settings, however, multiple instances of an object are often randomly arranged in bins, implicitly providing several views of the same object within a single image. We show that this implicit multi‑view geometry can be exploited to simultaneously reconstruct the object in 3D and estimate the 6D pose of each visible object instance. We present MooMIns, a new Gaussian‑splatting‑based approach that inverts the original Gaussian splatting formulation: instead of rendering a single scene from multiple cameras, we render multiple object instances from a single camera. Our method is initialized with SAM3 instance segmentation masks and a modified Structure from Motion (SfM) pipeline. In contrast to learned monocular depth estimation, we perform true geometry‑based reconstruction from image evidence, avoiding hallucinations caused by training data priors. We evaluate MooMIns on synthetic and real bin‑picking scenarios, and demonstrate accurate reconstruction of previously unseen objects as well as reliable pose estimation of individual instance

Abstract:
To better understand Martian Surface, which is needed to enable Rovers navigate Mars with ease, it is necessary to be able to determine the location of mounds. Detecting and studying these morphologies can also help us find evidence of extraterrestrial life, in this case, more specifically, water or signs of life conducive environments. Detection of mounds was done by manually mapping morphological parameters onto Digital Elevation Models. This paper solves the problem by automatically detecting and or predicting mounds on Mars using Neural Network based Semantic Segmentation methodologies. This is done by using supervised semantic segmentation model and generative adversarial approach. A comparison of the approaches shows that adding extra artificially generated data did not improve the result.

Abstract:
Multi‑object tracking has a heavy‑tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment‑uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training‑free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3‑Deep‑EIoU with global track association achieves state‑of‑the‑art performance on the benchmark with 86.8 HOTA.

Abstract:
Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross‑modal fusion. To address these challenges, we first propose a dual‑disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi‑granularity cross‑modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state‑of‑the‑art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low‑light settings.

Abstract:
Open‑vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel‑level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single‑layer vision‑language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross‑layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure‑aware framework built upon Layer‑wise Accumulated Structural Attention (LASA), which aggregates multi‑layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS‑COCO, SFSD, and FrISS show that LASA improves mIoU by +3.43, +8.01, and +15.74 over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

Abstract:
Point cloud semantic segmentation requires architectures that capture both fine‑grained local geometry and broad global scene structure. Transformer‑based networks have demonstrated strong performance by focusing on detailed local feature aggregation; however, global context is conveyed primarily through skip connections across encoder‑decoder stages, which we argue is insufficient for full scene understanding. We hypothesize that augmenting skip connections with a learnable global feature extraction module allows the network to acquire scene‑level knowledge before descending into local detail, leading to richer and more contextually grounded representations. To this end, we propose Point Transformer with Wavelet Neural Operato (PT‑WNO), which integrates a shared Wavelet Neural Operator (WNO) branch alongside the skip connections of a point cloud transformer backbone. At each encoder‑decoder transition, point features are projected onto a dense 3D volumetric grid where the WNO captures multi‑scale global spectral context through learnable wavelet decomposition and reconstruction. These global features are fused back into the network via lightweight adapters, complementing rather than replacing the existing skip connections. Experiments on four large‑scale 3D point cloud benchmarks demonstrate the effectiveness of PT‑WNO. On S3DIS (Area 5), PT‑WNO achieves 71.59% mIoU, outperforming the Point Transformer v3 (PTv3) baseline by +1.03 points. On DALES it achieves 81.05% mIoU (+1.47 over the baseline). On ScanNet~v2, PT‑WNO obtains 76.19% mIoU, remaining competitive with the baseline (76.36%).

Abstract:
In this paper, we present a novel set of related models for semantic segmentation of node‑link diagrams. These diagrams are frequently used to represent mathematical graphs, relationships between concepts, and flowcharts. Such diagrams are difficult to access non‑visually; while some assistive interfaces have been designed for node‑link diagrams, they rely upon a machine‑readable representation of the diagram, whereas such diagrams will generally be made available as bitmap images. Our compact deep learning models show excellent quantitative and qualitative performance on a large synthetic dataset of node‑link diagrams, reaching per‑pixel accuracy over 93%.

Abstract:
Remote‑sensing and UAV applications need models that generalize across platforms and viewpoints without task‑specific training. Yet training‑free pipelines often falter on oriented geometry, scale/rotation variation, and crowded ports or airfields, and rarely unify detection and segmentation. We introduce ZODS‑RS, a training‑free, closed‑form pipeline that outputs horizontal boxes (HBB) and instance masks. Built on DINOv3 dense features and SAM‑style proposals, ZODS‑RS chains: PP (prototype purification via Tyler covariance), R‑SEM (rotation‑scale equivariant matching with separable kernels and global Hungarian assignment), and UAM (uncertainty‑aware pixelwise merging with adaptive priors and optional negative prototypes). A lightweight CWLA fuses multiple DINOv3 layers. On FAIR1M (HBB) we obtain \mathrmmAP_0.50:0.95=\mathbf13.06 and \mathrmAP_S=\mathbf2.93 \emph(class‑averaged over ship/airplane); on xView (HBB) we report \mathrmmAP=\mathbf16.69. On our UAV dataset, ZODS‑RS achieves mask \mathrmmIoU=\mathbf31.10 and improves small‑object AP by \mathbf+30.70 over Grounded‑SAM on a single 5090. This work offers a unified, \emphno‑training solution for horizontal‑box detection plus instance segmentation in aerial imagery; provides explicit closed‑form formulations for PP/R‑SEM/UAM tightly coupled with DINOv3; and demonstrates \emphconsistent gains on small and crowded targets and under cross‑domain shifts while keeping deployment simple.

Abstract:
Semantic segmentation in remote sensing requires costly pixel‑level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human‑in‑the‑loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo‑labels, propagation, CRFs, foundation‑model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open‑source platform, where an error‑weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum‑effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late‑iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output‑reading mechanisms (oracle entropy across budgets 1‑‑100x, pseudo‑labels across thresholds 0.90‑‑0.99, CRF‑based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human‑in‑the‑loop framework operating without auxiliary machinery.

Abstract:
Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection‑over‑Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per‑object normalization that distorts scores under object size variation; and greedy or one‑to‑many matching procedures that yield non‑optimal, order‑dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold‑free continuous metric that finds a globally optimal one‑to‑one matching between predicted and ground truth objects and aggregates total overlap using per‑pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.

Abstract:
Change detection and scene recognition techniques have been widely applied to Street View Imagery (SVI) to understand changes in scenes across the years. However, metadata alone is often insufficient to reliably find visually aligned image pairs. This study introduces the PairWise image finder, a tool that integrates feature detection and matching, supported by semantic segmentation masks to quantify the visual alignment of two images of varying time periods. The tool outputs the share of matched key features, the matched feature distance and coverage, and the alignment of semantic masks, which enables the user to filter image pairs depending on the alignment quality and use case. The visually aligned pairs derived from the tool can be used to accurately study explicit longitudinal change and help reduce manual effort for perception studies. The usability of the tool is demonstrated through a comparison of longitudinal changes, highlighting the importance of perspective when quantifying changes. The proposed method provides a scalable and open tool for researchers and stakeholders to find high‑quality image pairs for urban analysis, perception and related applications.

Abstract:
To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single‑object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB‑D observations, PhysGraph reconstructs object‑centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real‑world datasets, PhysGraph achieves state‑of‑the‑art results in semantic segmentation, multi‑object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint‑aware 3D affordance prediction and real‑to‑sim transfer, both of which are demonstrated in our experiments.

Abstract:
We present SegmentAnyTreeV2, a sensor‑ and platform‑agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization‑based Point Transformer v3 backbone with a lightweight semantic head and a tree‑focused cross‑attention mask decoder. Semantic predictions restrict instance decoding to tree‑class voxels, while instance‑aware query initialization, one‑to‑many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR‑instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR‑instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning‑based methods in both instance detection and mask completeness. Zero‑shot evaluation on independent sites further demonstrates strong cross‑domain generalization.

Abstract:
Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre‑trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub‑optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model with higher confidence. This bias is inherently model‑dependent and is influenced by factors such as data preprocessing techniques and training strategies. To address this bias, we propose a novel, training‑free 3D instance segmentation approach via Geometric Visual Correspondence (GVC‑Seg), which exploits the correspondence between 3D geometric cues and 2D visual cues to mitigate the confidence bias. Additionally, a 3D proposal generation module and a mask‑aware CLIP feature extraction module are introduced during the instance mask generation and instance semantic reasoning, respectively. In this way, GVC‑Seg enhances proposal quality assessment, ensuring unbiased ensemble learning across different models. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance on several challenging benchmarks, while also exhibiting strong potential in open‑vocabulary semantic segmentation settings.

Abstract:
Underwater instance segmentation integrates pixel‑level mask prediction and instance‑level discrimination for marine resource exploration, ecological monitoring, and underwater robotic perception. Recent prompt‑based and auxiliary‑modality methods improve mask quality, but their reliance on large foundation models, prompt generation, or extra modality estimation complicates efficient deployment. This work introduces Lightweight Underwater Salient Instance Segmentation Detection Transformer (LUSIS‑DETR), a compact detection‑transformer framework built around the Aqua Boundary‑Saliency Attention Module (AquaBSAM). AquaBSAM embeds underwater boundary, contrast, attenuation, chroma, dark‑channel, and center‑prior cues into DINOv2‑initialized multi‑scale features through bounded residual modulation, while auxiliary mask supervision and small‑object copy‑paste are training‑only. Extensive evaluation on four recent underwater instance segmentation datasets, UIIS, UIIS10K, USIS10K, and USIS16K, shows competitively leading performance against previous state‑of‑the‑art works across category‑aware and salient‑instance protocols. TensorRT half‑precision (FP16) benchmarking on an NVIDIA T4 graphics processing unit (GPU) achieves 4.31‑6.34 milliseconds (ms) latency, supporting real‑time inference under an accessible reproduction setting.

Abstract:
Semantic image segmentation assigns a predefined category label to each pixel, has achieved significant progress lately. Open‑Vocabulary Segmentation (OVS) extends the segmentation task from a fixed set to an open set, enabling the identification and segmentation of novel concepts based on arbitrary text inputs, such as category names or descriptions. In this paper, we propose a novel Semantic Calibration Network (SCN) for open‑vocabulary semantic segmentation. Different from prior approaches that focus on feature aggregation or simple fine‑tuning of pre‑trained models, SCN refines the mask classification process by explicitly modeling the semantic correlations between classes, aiming to enhance the model's discriminative power while effectively preserving the generalization abilities of the pre‑trained CLIP model. Specifically, SCN comprises two core components: Class Disambiguation (CD) and Logits Fusion (LF). First, a cross‑attention mechanism is utilized to transform the text embeddings into visually aware pseudo‑text embeddings, in order to derive an enhanced similarity score that complements the original mask‑text similarity score. Subsequently, the Class Disambiguation module captures implicit inter‑class dependencies through a residual architecture to effectively resolve semantic ambiguities. Finally, the Logits Fusion module dynamically integrates multifaceted semantic evidence to ensure that the model achieves a robust semantic consensus while maintaining CLIP's inherent generalization capability. Comprehensive experimental results on mainstream benchmarks demonstrate that the proposed method achieves significant performance improvements compared to state‑of‑the‑art algorithms.

Abstract:
In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model‑agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube‑VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query‑level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long‑term temporal association.

Abstract:
Novel class discovery in point cloud segmentation aims to transfer knowledge from known classes to automatically identify and segment unlabeled novel classes in point clouds. Existing methods mainly rely on pairwise associations for class assignment and novel class reasoning, which limits their ability to capture complex relationships among known and novel classes and may lead to inaccurate semantic segmentation. To address this issue, we introduce a hypergraph‑based framework that models high‑order associations among classes and enables collaborative reasoning from known classes to novel classes beyond traditional pairwise relations. Moreover, existing methods tend to focus on semantic feature extraction while paying insufficient attention to geometric information in point clouds. To better exploit spatial structure, we propose Geometric‑Aware Prototypes to enhance the representation of class‑level geometric cues. By propagating geometric information through hyperedges, the proposed method improves the understanding of spatial distributions across classes and leads to more accurate segmentation. Experiments on the SemanticKITTI and SemanticPOSS datasets demonstrate the effectiveness and superiority of our method.

Abstract:
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo‑labels from scene‑centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo‑labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state‑of‑the‑art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label‑efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.

Abstract:
Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two‑stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM‑BoxPredictor

Abstract:
Weakly Incremental Learning for Semantic Segmentation (WILSS) suffers from the continuous introduction of noisy supervision, which progressively corrupts class‑level representations, leading to severe feature drift and semantic corruption, thereby causing newly learned classes to overwrite old ones. To address these issues, we propose a drift‑resilient WILSS approach, named SASA, designed to stabilize semantic learning via Semantic Anchors and Spatial Arbitration. Specifically, at the representation level, we introduce semantic anchors of learnable tokens as rigid class‑level references to preserve long‑term semantic identity. Complementary to this, an elastic residual adaptation facilitates controlled, instance‑specific refinement, ensuring a stable yet flexible learning trajectory. At the supervision level, we develop a Spatial Label Arbitration mechanism that performs geometry‑aware decisions to directly filter unreliable signals and enforce a strict "one object, one class" constraint. By synergistically stabilizing representations and improving supervision reliability, SASA effectively mitigates feature drift under weak supervision. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms existing state‑of‑the‑art methods, particularly in challenging multi‑step incremental settings. The code is available at https://github.com/ZhonggaiWang/SASA.

Abstract:
Autonomous systems require robust Multi‑Object Tracking and Segmentation (MOTS) to operate reliably in dynamic environments, ensuring consistent object identities and precise mask‑level delineation. Foundation models such as SAM2 have shown strong zero‑shot generalization for segmentation, but their direct application to MOTS is limited by unreliable track association and false‑positive propagation. This work introduces Seg2Track++, a framework that integrates instance segmentation with SAM2 and a novel track management module to perform zero‑shot MOTS with enhanced temporal consistency. Tracks are associated using Mask Centroid Distance (MCD) and Confidence‑Aware Cost Modulation (CCM), while Probabilistic Track Validation (PTV) employs a Bernoulli filter to validate track existence and suppress ghost tracks. Experimental results on KITTI MOTS demonstrate improved identity preservation, reduced false‑positive propagation, and robust track management without fine‑tuning.

Abstract:
Real‑time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non‑maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real‑time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual‑head design for native NMS‑free end‑to‑end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon‑SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference‑time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task‑specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open‑vocabulary extension, YOLOE‑26, for text‑, visual‑, and prompt‑free inference. Across all scales, YOLO26 achieves 40.9‑57.5 mAP on COCO at 1.7‑11.8 ms T4 TensorRT latency, advancing the accuracy‑latency Pareto front over prior real‑time detectors, while YOLOE‑26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.

Abstract:
Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross‑modal fusion and the more severe long‑tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open‑pit mines. Based on this, we propose UnsOcc, a multi‑modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering‑based fusion module, RenderFusion, which enhances cross‑modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail‑aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long‑tail categories. Extensive experiments on both the open‑pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state‑of‑the‑art approaches.

Abstract:
We present a novel compact deep multi‑task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre‑processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real‑world nuScenes‑lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact‑perception.

Abstract:
Audio‑visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine‑grained, human‑centric scene understanding. This capability is essential for real‑world applications such as intelligent video editing, surveillance, and human‑computer interaction. However, existing datasets are largely limited to simple or homogeneous audio‑visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio‑visual co‑occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross‑modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human‑centric audio‑visual instance segmentation (AVIS) dataset designed for dynamic real‑world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human‑centric audio‑visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/

Abstract:
We propose an online monocular perception‑to‑control framework that embeds semantic risk into the distance field used by Control Barrier Function (CBF)‑based safe navigation and teleoperation. Many perception‑based safety filters assign the same distance‑based safety margin to all mapped obstacles or use semantics only as a downstream controller adjustment, rather than encoding semantic risk in the spatial representation. Our framework instead reasons online about obstacle geometry and class‑dependent risk by embedding semantic information directly into the Euclidean Signed Distance Field (ESDF). This design encodes semantic risk before control optimization, so high‑risk objects exert a larger spatial influence in the safety field while retaining efficient ESDF queries at runtime. Specifically, a foundation‑model‑based SLAM front end reconstructs dense 3‑D geometry from monocular RGB video, while per‑frame semantic segmentation provides pixel‑level class labels that are fused into the reconstructed geometry. The resulting geometric‑semantic representation is then converted into an ESDF, where semantic labels identify safety‑relevant regions and impose class‑dependent inflation before field computation. The semantic‑aware ESDF provides the local distance values and spatial derivatives required by the CBF controller, while class‑dependent gains further regulate the controller response. Extensive simulation and hardware experiments demonstrate online operation at 10‑‑20 Hz and semantic‑aware safe behavior in both teleoperation and autonomous navigation.

Abstract:
Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real‑world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high‑fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel‑aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.

Abstract:
Motion blur from high‑speed UAV acquisition de‑grades semantic segmentation on rare texture‑dependent classes with high agronomic value. Standard CNNs rely on high‑frequency magnitude features that blur destroys, causing statistical erasure of minority signals. We propose Dual Quantile Activation (QAct), a rank‑aware block replacing magnitude gating with instance‑level rank normalization. Evaluated onAgriculture‑Vision 2021 across zero‑shot and blur‑supervised regimes at multiple severities, QAct is the dominant architectural factor: it delivers consistent mIoU gains over ReLU across both regimes and all severities, with strongest gains on rare structural and texture‑dependent classes. Some dominant classes (water,planter skip) show mixed per‑class performance under distillation. At moderate blur, zero‑shot QAct outperforms distillation‑trained ReLU; across all severities, Distill‑QAct achieves best performance, confirming rank aware activation and blur‑domain training are complementary robustness sources.

Abstract:
Large‑scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource‑constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer‑based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi‑Scale Windowed Attention Distillation (MWAD) module that aligns teacher‑student attention‑based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi‑scale design, captures both short‑ and long‑range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student's feature distribution by enforcing inter‑class separation and intra‑class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student's reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state‑of‑the‑art performance.

Abstract:
This work presents a configurable pipeline for generating semantic‑segmentation‑ready agricultural datasets from Sentinel‑2 imagery and EuroCrops parcel‑level annotations. The workflow transforms heterogeneous vector crop annotations into aligned multispectral image‑‑mask pairs through label harmonization, Sentinel‑2 product selection, spatial alignment, rasterization, patch extraction, quality filtering, and class‑aware sample selection. The generated dataset contains 67,337 patches from five European countries and uses a reduced taxonomy of ten crop classes plus background. A four‑level U‑Net with Group Normalization was trained using 10 Sentinel‑2 spectral bands and a composite loss combining class‑weighted cross‑entropy and Dice loss. On the internal EuroCrops‑based test split, the model achieved a mean Intersection over Union (mIoU) of 0.7665, a pixel accuracy of 0.8693, and a mean class accuracy of 0.9072. Compared with spectral and spatial‑context Random Forest baselines, the U‑Net showed the importance of learned multi‑scale spatial representations for crop segmentation. External evaluation was performed on unseen Belgian EuroCrops subsets, DACIA5, and PASTIS. The results show a clear performance gap under external and cross‑dataset evaluation, especially for benchmarks with different taxonomies, annotation protocols, spatial coverage, or temporal organization. The model transfers more reliably to dominant and taxonomically aligned classes such as maize and wheat, while performance remains limited for several minority classes and for the adapted single‑date PASTIS setting. These findings highlight both the potential and the limitations of using EuroCrops‑derived supervision for Sentinel‑2 crop segmentation under realistic domain shifts.

Abstract:
In this work, we present our solution for the 8th UG2+ Challenge (CVPR 2026) Track 3: Dynamic Object Segmentation in Turbulence (DOST). Our method is built upon the strong baseline framework Segment Any Motion (SegAnyMo), which provides powerful mask generation and motion tracking capabilities. To further boost the segmentation performance under severe atmospheric distortions, we propose two key improvements. First, we employ a data‑centric domain adaptation strategy. We significantly expand our training data by incorporating selected sequences from the DAVIS dataset alongside a subset of the DOST dataset, and apply simulated atmospheric fluctuation degradations to enhance the model's robustness against complex geometric distortions. Second, we introduce a spatio‑temporal post‑processing module. This refinement step effectively removes persistent boundary‑connected false foregrounds and short‑lived fragmented noise, while strictly preserving genuine small targets and maintaining original individual labels across frames. With these combined strategies, our proposed method ranks the 2st place in the challenge.

Abstract:
Police‑reported crash statistics remain the standard input for urban road‑safety assessment, but their incompleteness and reporting lag limit their usefulness for timely, fine‑grained intervention design. Harsh acceleration and braking events are widely used as surrogate safety indicators, but have so far been studied only in comparatively small urban samples. This study analyses harsh events across the urban road network of Milan, combining high‑resolution telematics from more than 4.2 million vehicles equipped with On‑Board Units, segment‑level traffic metrics from TomTom, street‑network and infrastructure attributes from OpenStreetMap, and visual streetscape features extracted from Google Street View via semantic segmentation using a OneFormer model. We employ an analytical framework combining non‑parametric Mann‑‑Whitney U tests of segment‑feature distributions between high‑ and low‑harshness groups with supervised machine‑learning regressors. We find that, once exposure is controlled for, wider carriageways, crossings and transit stops, and more open visual fields (higher sky‑ and road‑pixel proportions) are associated with higher harsh‑event intensity, while denser built frontage is associated with lower intensity. Finally, the cycling‑infrastructure case study identifies a gradient in harsh‑event intensity across facility types: markings‑only cycle lanes are associated with a 19.5% higher harshness score, and mixed‑traffic configurations with an 11.5% higher score, relative to physically separated cycle paths, conditional on the included controls. These results support context‑specific rather than uniform urban‑safety interventions and illustrate how large‑scale telematics combined with open geospatial and visual data can inform Vision Zero decision‑making at the metropolitan scale.

Abstract:
LiDAR semantic segmentation is a core perception capability for autonomous vehicles and mobile robots. However, safe operation also depends on knowing when predictions are unreliable. Existing approaches typically rely on softmax confidence, which is often miscalibrated and overconfident, while stronger uncertainty estimates from Monte Carlo dropout or ensembles are often computationally expensive for real‑time use. To this end, we introduce a novel, architecture‑agnostic uncertainty‑aware Adapter Head. It decomposes the prediction into a Preference Head for class ranking and a Strength Head that refines uncertainty assessment, thereby enabling a principled construction of evidential Dirichlet representations. Building on this design, we propose our inverse‑vacuity self‑calibration objective (Invascal), which directly supervises the strength signal to produce reliable and well‑calibrated uncertainty estimates while preventing runaway evidence growth. We evaluate our framework across multiple LiDAR datasets and backbone architectures. We compare against deterministic training, Monte Carlo dropout and ensembles, and prior evidential methods. Our approach consistently improves uncertainty calibration over traditional deterministic methods with minimal computational overhead. At the same time, it preserves competitive segmentation accuracy, where prior evidential methods often suffer performance degradation.

Abstract:
The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production, and metabolic regulation. While foundation models like the Segment Anything Model (SAM) have revolutionized natural image segmentation, their direct application to FM is hindered by a significant domain shift characterized by diffraction‑limited resolution, low contrast, and complex overlapping organelle networks. Furthermore, the development of robust models is bottlenecked by a severe lack of high‑quality, manually annotated instance segmentation datasets for mitochondria. In this paper, we propose a scalable solution to this data scarcity by finetuning SAM exclusively on synthetically generated FM data. We simulate realistic mitochondria data and emulate the optical properties of fluorescence microscopes to create a large‑scale annotated dataset. We evaluate our fine‑tuned model on a curated dataset of real, manually annotated FM images. Qualitative and quantitative analyses demonstrate that our synthetically fine‑tuned model improves precision and average dice score over strong baselines. This work establishes the potential of simulation‑assisted training for FM instance segmentation.

Abstract:
Plain Transformers have become the de‑facto architecture for processing text, audio, image, and video, offering a unified backbone for multimodal learning. However, state‑of‑the‑art architectures for point cloud semantic segmentation remain dominated by U‑Nets architectures where convolutions are interleaved with local or windowed attentions. In this work, we show how to effectively leverage vanilla, non‑hierarchical ViTs for segmentation of large‑scale automotive lidar scenes. We bridge the performance gap thanks to a carefully designed tokenizer, a lightweight decoder segmentation head, and tailored data augmentations. Our approach, VaViT for Vanilla ViT, matches or exceeds the performance of state‑of‑the‑art methods while maintaining the simplicity of ViT architecture. We provide extensive evaluations on nuScenes, SemanticKITTI, and Waymo Open Dataset to validate the efficiency of our method. Code and models are available at https://github.com/valeoai/VaViT.

Abstract:
In this work we introduce a novel approach to domain incremental learning, adapting models over time to evolving, non‑stationary data. In contrast to other works, we do not attempt to avoid catastrophic forgetting, but rather allow it and exploit it. Our model combines a main task head with a self‑supervised masked autoencoder (MAE) head. We then learn domain‑specific LoRA adapters during incremental training. Each adapter specializes to its domain, naturally inducing forgetting on other domains in both heads. At inference, we perform online test‑time training on the self‑supervised MAE head to identify which LoRAs best matches the current input, so the model can `remember' the domain again. Our scheme is especially well‑suited to real‑world streaming data, such as video, where consecutive samples are highly correlated and domain shifts are gradual. We demonstrate our method on domain‑incremental action recognition and semantic segmentation tasks.

Abstract:
The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One‑to‑One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction‑ and ground‑truth‑side degrees yields four matching strategies: One‑to‑One, Many‑to‑One, One‑to‑Many, and Many‑to‑Many. We show that the first three are well‑defined within the PQ framework, while Many‑to‑Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex‑based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part‑aware panoptic segmentation, and we explore part‑aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open‑source package built on Panoptica. It exposes Voronoi‑based region‑wise analysis, part‑aware evaluation, and Area Under Threshold Curve computations as configurable options.

Abstract:
Recent crack segmentation methods often follow generic semantic segmentation designs, using stronger backbones, hybrid CNN‑Transformer‑Mamba encoders, and auxiliary enhancement branches. Although effective, this raises whether stronger generic feature mixing is the most suitable direction for crack segmentation. We instead formulate crack segmentation as sparse structural recovery. Cracks have limited category‑level semantics but strong morphological regularities, being thin, sparse, anisotropic, locally fragmented, and easily confused with textures or shadows. Thus, the key bottleneck lies in preserving weak structural evidence, recovering directional continuity, and suppressing background coupling. We propose RIFT, a compact family of morphology‑aligned crack segmentation models. Rather than compressing a complex generic architecture, RIFT is simple by design, preserving local evidence, aggregating cooperative directional continuity, and restoring crack structures through lightweight multi‑scale fusion. Experiments on four public benchmarks show that RIFT achieves the best or tied‑best results across the 16 main metrics against reproduced representative baselines. RIFT‑B gives the strongest overall accuracy, while RIFT‑T provides the best deployment efficiency with only 0.47M parameters and high inference speed. Topology‑aware evaluation, ablations, transfer experiments, and visualizations further verify that task‑aligned simplicity can match or surpass complex hybrid architectures when its inductive bias fits crack morphology. Code: https://github.com/xauat‑liushipeng/RIFT

Abstract:
Knowledge graphs over corpora of inter‑referencing documents ‑ scholarly papers, legal opinions, policy briefs ‑ encode the topology of reference but not its stance. The standard representation collapses a rich evaluative relation into an untyped edge, losing the very content that supports community‑level queries about how one document is received by another. We propose the claim network: a representational pattern in which each cross‑document reference is reified as a typed claim, carrying source, target, claim text, and a four‑class stance label grounded in the citation‑intent literature. We give a construction pipeline applicable to any corpus of scholarly inter‑referencing documents and instantiate it on a corpus of 127 papers in 3D point cloud semantic segmentation, producing a network of 8,260 typed claims. Three downstream task families demonstrate what the network enables: retrieval signal augmentation, aggregated‑stance summarisation, and topological analytics. Head‑to‑head evaluation against standard Retrieval‑Augmented Generation (RAG) baselines shows that the gain over flat retrieval is the gain from the right intermediate representation rather than the wrong one.

Abstract:
Modern vision models process images in a single feed‑forward pass, which limits their ability to recover missing evidence or refine uncertain representations under incomplete observations. Inspired by the iterative nature of human perception, we introduce PRISM (Progressive Reasoning through Iterative Slot Memory), a pyramid vision architecture that reasons over images through iterative refinement. At a high level, PRISM groups visual features into object‑centric representations, retrieves relevant patterns from a learned memory, and iteratively refines the representation to resolve ambiguity and recover missing information. This organize‑recall‑refine process operates recurrently across multiple scales, enabling progressive improvement of visual representations. Across standard vision tasks, including image classification, object detection, and semantic segmentation, PRISM achieves competitive performance while demonstrating improved robustness under incomplete observations such as occlusion. These results suggest that iterative reasoning with structured representations and memory is a promising direction for building more resilient and adaptive vision models. Source code and models will be released.

Abstract:
Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre‑computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence‑level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D‑aware, temporally fine‑grained segmentation of multiple moving objects, alongside a foreground‑‑background variant GMOS‑S for faster deployment. To support training and evaluation in this regime, we curate GMOS‑2K, a dataset of 2,210 real‑world videos with per‑object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS‑I ("I" for instantaneous), a temporally fine‑grained evaluation protocol with three complementary metrics. GMOS achieves state‑of‑the‑art results across MOS, MOS‑I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi‑object MOS methods and supporting online inference for streaming deployment.

Abstract:
Integrating open‑vocabulary semantic information into dynamic 3D scene representations is essential for long‑term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross‑view cues, while their limited ability to handle object‑level topological changes restricts long‑term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground‑truth 3D geometry. To address these challenges, we present DGSG‑Mind, a hybrid instance‑aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross‑modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian‑based visual relocalization and localized masked refinement guided by geometric‑semantic consistency. Built on the instance Gaussian map, DGSG‑Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial‑semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG‑Mind achieves the best zero‑shot 3DVG performance among methods operating on self‑reconstructed maps, while also delivering strong performance in 3D open‑vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG‑Mind on real‑world robots to demonstrate its target‑oriented reasoning and dynamic update capabilities. The project page of DGSG‑Mind is available at https://icr‑lab.github.io/DGSG‑Mind

Abstract:
Self‑supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self‑attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3‑Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.

Abstract:
Semantic segmentation is crucial for autonomous navigation in off‑road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off‑road conditions, such as source‑target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST‑Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST‑Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style‑augmented learning through a deep texture manifold. Experiments across various distribution‑shifted target domains demonstrate the effectiveness of ST‑Seg, with substantial improvements over existing methods. These results highlight the robustness of ST‑Seg, enhancing the real‑world applicability of semantic segmentation for off‑road navigation.

Abstract:
Vision‑based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task‑agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision‑to‑Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task‑specific knowledge through learnable traversability prompts while preserving the VFM's cross‑domain generalization. To handle annotation ambiguity, we introduce Perspective‑Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic‑traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real‑world off‑road datasets, demonstrate that ViTA achieves state‑of‑the‑art IoU and Precision with substantial false‑positive reduction and strong cross‑domain generalization.

Abstract:
Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state‑of‑the‑art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real‑time, fine‑grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource‑constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi‑scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.

Abstract:
This technical report presents our solution for the CVPR 2026 UG2+ Challenge Track 3: Dynamic Object Segmentation in Turbulence (DOST). We design a training‑free multi‑signal segmentation pipeline that combines pretrained motion estimation, self‑supervised semantic priors, background anomaly modeling, manually calibrated proposal fusion, and SAM2‑based mask refinement. The method uses RAFT for dense motion responses, DINOv2 for semantic objectness priors, ViBe for training‑free background modeling, and pretrained SAM2 for box‑prompt mask refinement. Instead of optimizing an end‑to‑end segmentation network, our system operates entirely in inference mode. This design is suitable for the DOST setting, where severe atmospheric turbulence produces pseudo‑motion, blur, and intermittent target visibility, making a single motion cue unreliable. The final submitted masks are evaluated by the official leaderboard, which reports 0.425041 mIoU and 0.457206 mDice. Since no task‑specific model training or fine‑tuning is performed, stronger learned temporal association, adaptive proposal selection, or task‑specific adaptation may further improve the system.

Abstract:
Dense semantic segmentation is essential for autonomous driving, yet many multi‑modal datasets lack pixel‑level annotations. The Zenseact Open Dataset (ZOD) provides rich multi‑sensor data but only bounding‑box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)‑based annotation pipeline that produces dense, pixel‑level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300‑frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer‑based CLFT and CNN‑based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT‑Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous‑vehicle platform, achieving 77.5% mIoU, and show that SAM‑derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

Abstract:
This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap ‑‑ models that perform well on the validation set often collapse on the test set. For instance, SegFormer‑B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre‑trained SegMAN‑S backbone, we systematically study the effects of domain‑adaptive fine‑tuning, multi‑source data mixing, scene‑balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9% mIoU on the official test set while maintaining a validation‑test gap of only 6.5 points ‑‑ less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather‑robust segmentation under limited data.

Abstract:
Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision‑based traversability estimation methods rely on robot‑specific annotations or semantic class mappings, limiting transferability across platforms and requiring costly re‑annotation when robot capabilities change, while standard semantic segmentation methods only focus on specific predefined classes, which do not capture the variety of terrains. In this work, we propose a transformer‑based architecture that jointly performs class‑specific semantic segmentation and class‑agnostic terrain segmentation within a unified network, called Trinity. Terrain regions are segmented based solely on visual appearance, without predefined semantic labels or robot‑dependent traversability scores. This formulation enables the learning of robot‑agnostic visual terrain priors that can be combined with robot‑specific experience for downstream tasks such as traversability estimation, visual odometry, and mission planning. To enable large‑scale training with diverse terrain appearances, we extend the OAISYS simulator and introduce RUGDSynth, a synthetic dataset inspired by RUGD with class‑agnostic terrain samples. Furthermore, we present the EXTerra Dataset, providing real‑world images annotated with both class‑specific and class‑agnostic terrain labels. Experiments demonstrate the feasibility of the proposed task and the effectiveness of our joint segmentation approach in complex outdoor environments. Code and datasets will be released with this publication (after review).

Abstract:
Real‑time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine‑tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state‑of‑the‑art real‑time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real‑time and lightweight strategy by implementing a parallel Derivative‑branch (D‑branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real‑time performance.

Abstract:
This dataset provides high‑resolution, annotated video sequences of shredded E40‑grade steel and copper scrap on a conveyor belt. Captured in a controlled laboratory environment, the data reflects the industrial post‑magnetic sorting stage, where manual intervention is typically required to remove copper contaminants. The dataset comprises 24,297 labeled frames across five subsets, featuring 396 steel and 101 copper objects categorized by size. It supports the development of machine learning models for material classification, object detection, and instance segmentation. Variations in object spacing and density are included to simulate realistic industrial sorting conditions. Ground truth annotations include pixel‑wise segmentation masks and material classes. This dataset serves as a benchmark for evaluating automated sorting algorithms aiming to identify copper impurities within complex, heterogeneous steel scrap streams.

Abstract:
Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per‑scene manual annotation and per‑view pseudo mask generation, which suffer from multi‑view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open‑world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi‑view consistent track‑then‑label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory‑Aware Semantic Consensus Module (TSCM) which aggregates cross‑view predictions via synonymous clustering and trajectory‑aware voting to establish a canonical semantic identity, thereby ensuring multi‑view consistency. Furthermore, we employ a visibility‑aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine‑grained referential cues to ensure robustness under varying query specificities using a multi‑positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state‑of‑the‑art performance.

Abstract:
We present a method for jointly predicting instance‑level roof segment masks together with three continuous geometric attributes ‑‑ building height, roof slope, and roof azimuth ‑‑ from a single aerial orthophoto. Our approach extends Mask R‑CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log‑normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large‑scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR‑based 3D building dataset. Using a DINOv3 ConvNeXt‑Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP_50 of 0.566. The predicted per‑segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.

Abstract:
In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi‑instance segmentation under arbitrary instructions. We formulates instruction‑driven instance segmentation as a set‑structured query prediction problem and propose an explicit reasoning‑to‑instance query interface that elegantly bridges a vision‑language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance‑aware slot. A hybrid‑attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM‑conditioned queries are projected into SAM3's detector query space to drive accurate multi‑instance segmentation in a single forward pass. This design equips SAM3 with high‑level instruction understanding, compositional reasoning, and instance‑level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high‑quality and large‑scale instruction‑based instance segmentation dataset and benchmark that couples free‑form instructions with instance‑level masks. Extensive experiments show that only 2B‑scale InstructSAM achieves strong results across complex instruction‑driven and phrase‑level referring segmentation benchmarks, outperforming prior end‑to‑end methods and SAM3's agentic pipeline while enabling efficient single‑pass multi‑instance prediction.

Abstract:
Automated pavement distress assessment requires more than image‑level classification or coarse bounding box detection, demanding precise localization of thin, branching, and irregular cracks to achieve the geometric precision necessary for maintenance‑relevant quantification. This paper presents a vision‑based pavement distress analysis system based on Mask R‑CNN instance segmentation and evaluates it on UWGB‑StreetCrack, a custom field‑collected roadway image dataset acquired with a vehicle‑mounted smartphone and manually annotated with polygon labels for longitudinal cracks, transverse cracks, alligator cracks, and potholes. Five Detectron2‑based Mask R‑CNN backbone variants were considered under a consistent fine‑tuning protocol. The best‑performing model, Mask R‑CNN with a ResNet‑101 FPN backbone, achieved 84.23% precision, 90.04% recall, and an F1 score of 87.04% under the project‑specific bounding‑box matching protocol. The same model produced an aggregate predicted crack‑area fraction of 2.164%, closely matching the 2.170% ground‑truth crack‑area fraction. To contextualize the segmentation system against a detector‑oriented alternative, a CSPDarknet53‑based YOLO detector was also adapted and retrained on the dataset, reaching 27.5% precision and 20.7% recall on the validation protocol. The results show that instance segmentation is a practical direction for field pavement imagery and aggregate crack‑area estimation, while also exposing open challenges in annotation consistency, class imbalance, confounder rejection, and mask‑level benchmarking.

Abstract:
Recent semantic segmentation research has increasingly moved toward stronger context modeling, dense attention, and transformer‑based architectures. Although these models achieve impressive performance, classical CNN‑based segmentation pipelines remain attractive because of their simplicity, efficiency, and ease of implementation. This paper revisits a practical question: how far can a ResNet‑based segmentation model be improved by only modifying the segmentation head? We propose ATV‑Net, an Adaptive Triple‑View Network that strengthens a ResNet‑101 backbone using three simple but complementary receptive‑field views. The micro view captures point‑wise semantic responses, the local view models neighborhood structures and object boundaries, and the scout view provides enlarged contextual cues. Instead of fusing these views with fixed weights, ATV‑Net introduces an Adaptive Decision Gate that dynamically selects receptive‑field responses according to input scene characteristics. A compact global coordination layer is further applied to improve spatial and semantic consistency. Experiments on the Cityscapes validation set show that ATV‑Net achieves 80.31% mIoU. This result suggests that classical CNN‑based segmentation is still far from obsolete: with simple receptive‑field views and adaptive fusion, a ResNet‑based pipeline can reach a competitive accuracy level without relying on transformer‑style global attention or overly complex context modules.

Abstract:
Dataset distillation (DD) aims to compress large‑scale datasets into compact synthetic sets while preserving training efficacy. However, existing studies mainly focus on image classification, leaving dense prediction tasks such as semantic segmentation largely underexplored. In this work, we identify three key challenges for segmentation DD: (i) long‑tailed class imbalance, (ii) the need for strict pixel‑wise alignment between images and dense labels, and (iii) the high computational cost of optimizing high‑resolution data with complex models. To address these challenges, we propose D3S2, a Diffusion‑guided Dataset Distillation framework for Semantic Segmentation. Our method adopts a two‑stage design. In Class‑Balanced Mask Selection, we construct a representative mask set via a greedy strategy that prioritizes underrepresented classes. In Diffusion‑Guided Image Synthesis, we employ a pretrained layout‑to‑image diffusion model to generate images conditioned on the selected masks, naturally ensuring spatial alignment. To further enhance the training utility of synthesized data, we introduce guided diffusion sampling with two complementary objectives: a segmentation‑consistency loss for pixel‑level alignment, and a class‑wise feature matching loss for aligning per‑class feature statistics across layers. Extensive experiments demonstrate the superiority of D3S2. Notably, at an extremely compression rate of 1%, our method achieves 24.99% and 35.49% mIoU on ADE20K and COCO‑Stuff with Mask2Former (Swin‑S), outperforming random selection by 9.34% and 5.70%, respectively.

Abstract:
Amodal instance segmentation aims to predict the complete object mask including occluded regions that lack pixel‑level observations and must be inferred with the aid of shape priors. Existing methods acquire shape priors through fixed‑capacity encoding spaces or expensive generative models, and inject them uniformly across all spatial positions without adapting to the varying prior demand between visible and occluded regions. In this paper, we propose a gated reliability‑adaptive shape prior framework, which introduces a shape prior memory module that combines learnable prototypes via cross‑attention to produce instance‑adaptive shape priors through weighted prototype combination rather than generation. A spatial adaptive reliability gate then employs the signed distance field of the visible mask to modulate injection intensity at each position according to its occlusion depth, preserving reliable features in visible regions while directing shape compensation toward occluded areas. Experiments on two mainstream amodal instance segmentation benchmarks demonstrate that the proposed method outperforms existing approaches under multiple evaluation settings, improving the mean intersection‑over‑union over occluded regions by over 11 percentage points on one of the two benchmarks under the standard setting, while using approximately one‑third of the total parameters. Linear probing analysis further reveals that the visible‑mask cross‑attention module implicitly encodes occlusion geometry into visual token representations, explaining the effectiveness of the proposed module decomposition.

Abstract:
Automated detection and masking of individual methane plumes from satellite imagery is important for operational emission attribution and quantification. We present a machine learning framework for plume detection from MethaneSAT retrieved column‑averaged dry‑air mole fractions of methane. We address two core challenges: the scarcity of labeled MethaneSAT data and the need for inference reliability across diverse atmospheric and surface conditions. We first demonstrate that Mask R‑CNN with a ResNet‑50 backbone outperforms U‑Net semantic segmentation on both MethaneAIR (an airborne version of MethaneSAT) and MethaneSAT data, with pixel‑level F1 score gains of 10.49 and 5.48 respectively. To address MethaneSAT data scarcity, we evaluate three cross‑sensor transfer strategies leveraging MethaneAIR flights and synthetic plumes. Mask R‑CNN with ResNet‑50 fine‑tuned from MethaneAIR pre‑trained weights is the most effective strategy, achieving instance‑level precision of 0.60 and a near‑perfect recall of 0.98 at the baseline operating point. A physics‑informed post‑processing pipeline converts detections into two operationally distinct modes. The first is a high‑sensitivity mode that applies morphological filtering and proximity‑based merging for comprehensive emission screening, achieving precision of 0.71 and recall of 0.94. The second is a high‑precision mode that additionally applies a distribution‑based classifier for confident source attribution, achieving precision of 0.92 and recall of 0.70. Manual review of detections classified as false positives against our wavelet‑based ground truth labels reveals that a meaningful fraction of cases correspond to real methane enhancements excluded by conservative labeling criteria, indicating that precision values reported are lower bounds on true detection performance... Our data and code are available at: https://doi.org/10.7910/DVN/FR959H

Abstract:
In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures ‑‑ whether MLP‑based, convolution‑based, or transformer‑based ‑‑ by providing features aligned with simple geometric shapes (and thus human‑interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add‑on layer

Abstract:
The demand for unmanned aerial vehicle (UAV)‑based image acquisition and analysis has surged, with UAVs increasingly utilized for semantic segmentation tasks. To meet the real‑time analysis requirements of UAV remote sensing missions, performing onboard computation and making decisions based on the results is a natural approach. However, deploying semantic segmentation on resource‑constrained UAV platforms presents two significant challenges: 1) hardware constraints limit the ability of UAVs to perform real‑time semantic segmentation, and 2) environmental variations during flight cause data distribution shifts, deviating from the original training data. To address these issues, this paper introduces SkySeg, a heterogeneous multi‑UAV air‑air cooperation framework that integrates computer vision and flight pattern to enable onboard semantic segmentation using low‑cost sensors. SkySeg employs an efficient information fusion inference method, combining low‑definition, wide‑area images with high‑definition, focused‑area images. Additionally, it incorporates a cross‑device test‑time adaptation (TTA) strategy to enhance segmentation performance in dynamic environments by collaboratively addressing distribution shifts of test data streams across UAVs. Experimental results demonstrate that our SkySeg framework accelerates inference latency by approximately 3.6x, improves onboard segmentation accuracy by 5.91%, and achieves a 10.91% average accuracy gain in the wild.

Abstract:
Vision Transformers (ViTs) can learn strong image‑level representations while their patch representations become less effective for dense prediction during prolonged training. We revisit this dense degradation phenomenon and argue that it is not fully explained by high‑norm artifacts alone. Instead, we characterize \emphsemantic diffusion: an optimization shortcut in which global semantic information spreads through patch tokens beyond what is locally justified. Our analysis shows that dense representation quality is not captured by locality alone: shallow features can remain better aligned with foreground regions yet underperform deeper features, and \texttt[CLS] features remain complementary for dense prediction. These observations suggest that the goal should not be to remove global context, but to make token interactions more selective. We therefore study sparse attention as a minimal intervention, replacing softmax attention with entmax‑1.5 while preserving global token connectivity. On DINOv1 ViT‑S/16 trained for 200 epochs on ImageNet‑1K, this change preserves ImageNet linear probing accuracy and substantially improves semantic segmentation performance: VOC mIoU increases from 42.80 to 48.78, ADE20K from 19.85 to 21.97, and Cityscapes from 36.79 to 37.87. These results suggest that selective token mixing is a simple and effective bias for improving dense ViT representations.

Abstract:
This study presents a closed‑loop robotic strawberry harvesting system that combines a robust vision module, simulation‑trained deep reinforcement learning (DRL) control, and ROS‑based realrobot execution. For perception, we propose HRAttnEdge‑YOLO26‑seg, a modified YOLO26‑seg architecture that incorporates a high‑resolution P2 branch, segmentation‑path attention, and edgesupervised prototype learning to improve instance segmentation in cluttered scenes. For control, we train a target‑conditioned Proximal Policy Optimization (PPO) policy in Isaac Lab to produce smooth joint‑position commands for a UR10e manipulator and deploy it on a UR10e robot for targetfruit reaching and harvesting. This simulation‑based approach reduces hardware dependency, lowers development cost, and allows scalable policy training without exhaustive physical trials before real deployment. The proposed vision model demonstrated the highest overall performance among the evaluated methods. On both self‑collected and public datasets, the model showed a 10 to 14% improvement in segmentation performance. In controlled in‑house tests, the PPO controller produced stable and dynamically smoother motion than a inverse kinematics (IK)‑based MoveIt baseline. In greenhouse trials, the proposed integrated system harvested 281 strawberries, achieving 96.6% reaching success, 91.3% grasp‑and‑pull success, and 84.3% overall harvesting success. These results illustrate that task‑specific perception combined with simulation‑trained PPO can serve as a practical and resource‑efficient alternative to conventional planner‑dependent reaching in manipulation, enabling reliable closed‑loop robotic harvesting in complex agricultural environments.

Abstract:
Training high‑capacity vision models from scratch requires substantial computational resources. To improve training efficiency of a wide target model, existing growth methods often assume the availability of narrower models, obscuring the true computational cost of the entire pipeline. We propose an efficient training protocol, RBDC, that builds wide models by coupling in a parameter‑free block‑diagonal way narrower, independently trained models in a recursive way. This allows a flexible allocation of the training budget available across all the models involved. Evaluated with vision transformers (DeiT) and convolutional networks (ResNet) on ImageNet, our RBDC training protocol shows a much better efficiency than models trained from scratch with the standard protocol, yielding 30% FLOPs reduction at similar test accuracies. It also achieves higher performances at same training FLOPs than training protocols from the model growth literature. Finally, we show that our models can serve as better backbones than their original counterparts for downstream object detection and instance segmentation tasks.

Abstract:
Vision foundation models pretrained on web‑scale data have recently shown strong transfer capabilities on many downstream tasks, but their effectiveness for industrial visual inspection remains unclear. Industrial data differ substantially from web‑data and often require fine‑grained dense prediction, raising the question of whether modern self‑supervised pretraining can improve over the conventional transfer‑learning paradigm based on supervised ImageNet initialization. In this work, we compare ConvNeXt backbones pretrained with supervised ImageNet classification or DINOv3 distillation, and relate them to the conventional ResNet‑50 baseline. We evaluate semantic segmentation, instance segmentation, and object detection across four downstream datasets spanning RGB surface‑defect inspection and X‑ray defect detection. We further study both frozen and fully finetuned adaptation regimes. Our results show that DINOv3 offers no clear advantage in frozen transfer, but provides a stronger initialization after full finetuning on RGB tasks, yielding faster convergence and better final performance. Under X‑ray modality shift, however, supervised ImageNet pretraining remains more effective in both frozen and finetuned settings. Overall, our findings suggest that modern vision foundation models are promising for supervised RGB industrial inspection, but their transferability is strongly conditioned by downstream adaptation and target modality.

Abstract:
Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high‑risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory‑driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC‑AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

Abstract:
Fine‑grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long‑tailed distribution and strong variation in image acquisition conditions. We propose a training‑free two‑stage framework that decouples segmentation from classification. SAM3 first produces class‑agnostic mushroom masks using macro‑taxonomic prompts, and DINOv3 then assigns fine‑grained labels through prototype matching in the embedding space. To improve this stage, we apply a simple transformation of the DINOv3 feature space that improves prototype‑based classification. Compared with class‑specific prompting, our approach is more scalable and keeps the segmentation cost low. We report results from one‑shot to few‑hundred‑shot regimes, providing, to the best of our knowledge, the first baseline for fine‑grained semantic segmentation in low‑data settings.

Abstract:
We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real‑world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non‑linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic‑to‑real benchmark families and 15 dataset‑level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank‑based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry‑based methods and appearance‑based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry‑only baseline and the strongest appearance‑only baseline .

Abstract:
Land Use Land Cover (LULC) classification is essential for national 3D mapping, geospatial analysis, and sustainable planning. Multispectral (MS) LiDAR provides synchronized spatial‑spectral information, and deep learning (DL) enables 3D point cloud semantic segmentation; however, adoption is limited by the lack of publicly available urban and suburban MS LiDAR datasets aligned with National Mapping and Cadastral Agencies (NMCAs) classification schemes. This study addresses these gaps by introducing L1 and L2 NMCA‑aligned LULC classification schemes and a new benchmark MS LiDAR dataset. We evaluate seven state‑of‑the‑art DL models and perform spectral ablation studies at both levels of detail. Results show that Point Transformer V3 achieves the best performance, with mIoU of 79.4% (L1, 8 classes) and 58.9% (L2, 20 classes) using a dual‑wavelength LiDAR system (532 nm and 1064 nm). Ablation results show that multispectral information improves performance over geometry‑only inputs, with gains of 1.1 percentage points at L1 and 7.8 points at L2. These results highlight the value of LiDAR reflectance for fine‑grained material discrimination and support the evolution of NMCA LULC schemes toward higher semantic detail. The Loosdorf‑MSL dataset contributes a new benchmark for consistent national and international LULC mapping.

Abstract:
Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer‑style vision backbones. We design activation‑free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation‑based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out‑of‑distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.

Abstract:
Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few‑shot supervision under distribution drift, we introduce gradient‑adaptive stabilization, a parameter‑wise regularization mechanism implemented via gradient‑scaled stochastic perturbations that promotes a principled stability‑plasticity tradeoff. We further leverage unlabeled data through semi‑supervised learning and introduce prototype anchored supervision that validates pseudo‑labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class‑incremental, domain‑incremental, and few‑shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

Abstract:
Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero‑shot generalization but frequently degrade under diverse real‑world illumination, particularly for instance segmentation. In this work, we address this limitation by developing Lighting Convolutional‑Attention (\lca), an adapter module that enhances segmentation robustness without fine‑tuning the heavy backbone. \lca employs a dual‑branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity‑based synthetic dataset specifically designed to accurately replicate complex real‑world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting‑robust segmentation.

Abstract:
The microscopic examination of white blood cells (WBCs) plays a fundamental role in pathology and is essential for diagnosing blood disorders such as leukemia and anemia. To support further research on WBC images, multiple datasets have been proposed. However, they mainly annotate cell categories, and lack detailed morphological characteristics that pathologists use to explain their interpretations of cells. To address this gap, we introduce WBCAtt+, a novel dataset of WBC images densely annotated with 11 morphological attributes and five pixel‑level cell components. With 113k image‑level labels and 10k segmentation maps, WBCAtt+ is the first to provide comprehensive annotations for WBC images. Leveraging this dataset, we provide baseline models for attribute recognition and semantic segmentation. We also design an attribute recognition model to incorporate compositional structure of cells, further improving the recognition performance. Lastly, we showcase various applications enabled by our dataset, such as explainable AI models, including counterfactual example generation. \revisionThe dataset and code are publicly available\footnotehttps://doi.org/10.57967/hf/8143.

Abstract:
Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross‑domain few‑shot segmentation (CD‑FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre‑training, making the model prone to overfitting during retraining, and (2) target‑domain shifts underrepresented during pre‑training, inducing cross‑domain inconsistency and layer‑wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three‑stage select‑regularize‑calibrate VFM‑based segmentation framework that learns effectively from limited labels and adapts to novel domains without source‑data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data‑dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior‑Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well‑structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel‑wise predictions, producing consistent masks. Together, these stages form a hierarchical select‑regularize‑calibrate pipeline that guides frozen VFM features in new domains while fine‑tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD‑FSS benchmarks.

Abstract:
Circular sample plots are a cornerstone of forest inventory, yet accurate measurement of tree diameter at breast height (DBH) and spatial location within such plots remains challenging. Conventional approaches rely either on costly terrestrial LiDAR systems or labor‑intensive manual methods involving calipers and compass bearings, limiting their scalability and accessibility in large scale environments. We present a lightweight, smartphone‑based pipeline that enables complete plot sampling based tree measurement from a single walkthrough video, requiring no specialized hardware beyond a consumer smartphone mounted on a portable stand. The proposed method integrates pretrained monocular depth estimation and tree instance segmentation with a simultaneous localization and mapping (SLAM) framework to jointly refine camera trajectories and depth across the video sequence. Tree positions and DBH estimates are recovered by fusing SLAM‑derived camera poses with segmented depth maps, with absolute real‑world scale anchored via a calibrated reference length. The system was evaluated in both managed forest plots and natural forest plot, achieving a mean absolute error of 1.51 cm (MARE 3.98%) and 2.30 cm (MARE 5.69%) respectively, with consistent performance across varying starting directions and positions. Cross‑video consistency analysis further demonstrated stable and reproducible tree localization across measurements initiated from different starting positions. The proposed approach achieves accuracy comparable to established field methods while substantially reducing equipment cost and operational complexity, making it accessible to both professional researchers and non‑expert forest managers in diverse operational settings.

Abstract:
We present CosFly, a box‑structured planning and multimodal simulation pipeline for aerial tracking, together with CosFly‑Track, a large‑scale UAV dataset for dynamic target tracking across diverse environments including urban centers, highways, rural landscapes, forests, and coastal towns. In our current implementation on CARLA, CosFly provides a modular 7‑step construction pipeline that converts complex 3D worlds into structured obstacle representations for planning, then projects the resulting trajectories back into multi‑modal sensor data ‑‑ including RGB images, high‑precision depth maps, and semantic segmentation masks ‑‑ paired with natural language navigation instructions. A key feature is the support for configurable fixed‑FOV zoom levels (one FOV setting drawn per trajectory and held constant throughout), enabling simulation of various focal lengths through camera‑intrinsic adjustments. The pipeline covers the complete workflow from 3D map export through grid simplification, pedestrian and drone trajectory planning, multi‑modal rendering with 6‑DOF pose annotations, quality inspection, and teacher‑student caption generation. We analyze two trajectory‑planning paradigms for aerial target tracking: a conventional two‑stage pipeline with front‑end candidate generation and backend refinement, and a direct gradient‑based formulation that optimizes multiple tracking constraints in a single objective. The public CosFly‑Track release contains 250 validated trajectories and approximately 100,000 rendered images with complete 6‑DOF drone pose annotations (position x, y, z and orientation yaw, pitch, roll). Together, the pipeline and dataset establish a scalable foundation for aerial‑ground collaborative research, supporting dynamic target tracking, UAV navigation, and multi‑modal perception across diverse environments.

Abstract:
Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel‑level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center‑bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising \approx150,000 high‑resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today's specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero‑shot FMs face significant challenges when deployed on real‑world infrastructure and even the performance of specialised models with domain‑specific supervision plateaus at \approx25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present‑day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.

Abstract:
Learning dense correspondences across deformable 3D shapes remains a long‑standing challenge due to structural variability, non‑isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed‑forward pass via nearest‑neighbor search in descriptor space. This formulation enables stable and topology‑invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state‑of‑the‑art inter‑category generalization while offering the best accuracy‑efficiency trade‑off among prior methods. It also achieves near real‑time inference without pre‑alignment, pairwise optimization, or post‑refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment‑ready paradigm for dense 3D correspondence.

Abstract:
Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi‑supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi‑stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high‑informative historical frames as the memory. In addition, a joint‑spatial‑temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token‑level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA‑V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource‑efficient solution for the widespread application of video segmentation models on devices.

Abstract:
Counting objects in an image is a task applicable across many domains. For instance, crowd counting, inventory counting, and cell counting have been the focus of recent research. The major challenges in estimating the count of objects include overlapping objects, object scale issues, occlusions, and varying lighting conditions. In this report, we explore the problem of counting machine washer parts. Our technique is an extension of FamNet with an additional loss component, trained on the given dataset. We compare to three baseline methods: a traditional image processing pipeline, instance segmentation, and density map estimation. We evaluate the performance of these algorithms by computing the Mean Absolute Error (MAE) and the Root Mean Squared Error (RMSE) between the true object counts and the model outputs. Our approach achieves a performance of 1.96 MAE.

Abstract:
Recent aerial vision‑language navigation (VLN) datasets have grown rapidly, but they primarily address goal‑oriented navigation to static destinations, leaving UAV visual tracking ‑‑ continuously following a moving target while maintaining visibility ‑‑ largely without dedicated training data. We introduce CosFlyTrack, a large‑scale multi‑modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six‑degree‑of‑freedom drone pose, target state with visibility flag, bilingual (Chinese‑English) instructions, and trajectory‑pair metadata. To generate high‑quality expert trajectories, we develop MuCO, a multi‑constraint optimizer that plans directly in continuous three‑dimensional space with BVH‑accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post‑hoc smoothing of grid‑based planners. Fine‑tuning experiments on seven vision‑language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero‑shot baselines, supporting the dataset as a training resource for dynamic target‑following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre‑trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly‑Track.

Abstract:
Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross‑model transferability. To bridge this gap, this paper proposes a Joint Multi‑Objective and Multi‑Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual‑level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross‑model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real‑world experiments demonstrate that JMOF outperforms state‑of‑the‑art baselines against diverse black‑box detectors. Crucially, JMOF exhibits substantial cross‑vision‑task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real‑world deployments.

Abstract:
Continual test‑time adaptation adapts a source‑pretrained model to non‑stationary, unlabeled target streams while retaining past competence, yet texture‑biased backbones risk error accumulation and catastrophic forgetting. Drawing inspiration from the process of decoupling shape and texture in the human visual system, we introduce MoASE, a plug‑in mixture‑of‑experts that disentangles domain‑agnostic structure from domain‑specific texture using Activation Sparsity Experts with Spatial Differentiable Dropout, forming complementary high‑ and low‑activation pathways, while high‑ and low‑rank bottlenecks diversify representations. The Activation Sparsity Gate produces input‑adaptive SDD thresholds for precise token selection, and the Domain‑Aware Router assigns per‑sample expert weights using texture‑sensitive cues. To curb confirmation bias on unlabeled streams and stabilize supervision, we then introduce Domain‑Adaptive On‑Policy Distillation to constitute MoASE++, with an EMA‑anchored on‑policy reverse KL distillation and an augmentation policy conditioned on entropy and confidence that aligns predictions across the same views and improves the robustness‑plasticity balance. Extensive experiments on classification (CIFAR‑10/100‑C, ImageNet‑C) and semantic segmentation (Cityscapes‑>ACDC) demonstrate consistent state‑of‑the‑art performance, offering a principled, controllable approach to continual adaptation in dynamic visual environments.

Abstract:
Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life‑threatening hemorrhages with high mortality and long‑term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population‑level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large‑scale, high‑quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow‑matching‑based methods for generating healthy vessel meshes with learning‑based approaches for anatomy‑conditioned aneurysm mesh generation ‑ aneurysms are computed from pre‑existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large‑scale datasets (e.g., for the training of mesh‑based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

Abstract:
Unsupervised pixel‑level video understanding remains challenging in real‑world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo‑labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo‑labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo‑label generator that reduces error accumulation during field degradation through cross‑frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross‑frame feature aggregation to enhance video‑level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel‑level video understanding.

Abstract:
Post‑disaster damage assessment requires rapid and accurate semantic segmentation of 3D point clouds to identify critical infrastructure such as damaged buildings and roads. Early Point Transformers (e.g., PTv1, PTv2) relied on computationally expensive neighbor searching (k‑NN) and Farthest Point Sampling (FPS). To improve efficiency, recent architectures like Point Transformer V3 (PTv3) adopted static serialization methods, such as Hilbert curves or Z‑order, to organize unstructured points for window‑based attention. However, these fixed orderings are not optimal for capturing the complex geometry of disaster scenes. In this paper, we propose OPTNet (Ordering Point Transformer Network), which introduces a learnable Point Sorter module. OPTNet utilizes a self‑supervised ordering loss to dynamically predict an optimal permutation that maximizes the locality of the attention mechanism. We evaluate our method on the 3DAeroRelief dataset, significantly outperforming state‑of‑the‑art baselines.

Abstract:
Annotating large‑scale LiDAR point clouds for 3D semantic segmentation is costly and time‑consuming, which motivates the use of semi‑supervised learning (SemiSL). Standard LiDAR SemiSL methods typically adopt a two‑step training paradigm, where pseudo‑labels are separately generated from a single distillation source, either from the same or another LiDAR representation. Such supervision relies on a unique source of pseudo‑labels, which can reinforce confirmation bias and propagate errors during training, ultimately limiting performance. To address this challenge, we introduce CoLLiS, a novel framework that leverages Collaborative Learning for LiDAR Semi‑supervised segmentation. Unlike prior paradigms with decoupled pseudo‑labeling and training phases, CoLLiS trains multiple representations collaboratively in a single step by treating them as coequal students. Each student is adaptively distilled from multiple representations, while inter‑student disparities are monitored online to resolve contradictory supervision and effectively mitigate confirmation bias. Extensive experiments on three datasets demonstrate that CoLLiS consistently outperforms state‑of‑the‑art LiDAR SemiSL methods, with particularly strong gains in low‑label regimes.

Abstract:
Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine‑structured, and low signal‑to‑noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high‑precision biomedical masks. We bridge this gap by introducing ViTC‑UNet, which conditions a UNet on frozen pre‑trained ViT representations through learnable tokens and a two‑way attention decoder. This combines ViT global visual priors with the local inductive bias and high‑resolution decoding capacity of UNets, while avoiding end‑to‑end ViT fine‑tuning even in cross‑domain settings. ViTC‑UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure‑conditioned UNet decoding can efficiently adapt large‑scale visual priors to high‑complexity biomedical segmentation.

Abstract:
Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long‑context scenarios such as Retrieval‑Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self‑contained blocks, and the inefficiency of existing block fine‑tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories‑including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human‑instinct‑aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine‑tuning, which uses a frozen full‑attention teacher model to guide the block‑attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token‑level loss weighting to focus learning on block‑attention‑sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near‑full‑attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.

Abstract:
Thermal infrared imaging is robust to illumination variations and smoke interference, making it important for all‑weather perception. However, the lack of natural color and fine texture limits target recognition, human visual interpretation, and the transfer of visible‑light models. Existing infrared colorization methods mainly rely on single‑band images, where insufficient spectral cues may lead to structural distortion and semantic confusion. Although infrared hyperspectral images provide rich spectral responses and material information, existing single‑band frameworks remain limited in modeling spatial‑spectral coupling and weak texture details. To address these issues, this paper presents FSCM, a spectral‑information‑guided GAN framework. Within FSCM, a frequency‑enhanced spatial‑spectral state‑space generator composed of cascaded FSB units is constructed. Each FSB integrates three complementary components: state‑space modeling captures global spatial‑spectral dependencies; the frequency enhancement module (FEM) combines multi‑level wavelet decomposition and Fourier gating to recover structural contours, directional high‑frequency details, and global frequency responses; and the dual‑stream hybrid gating module (DGM) integrates deformation‑aware sampling with sparse attention to enhance effective local structures and suppress background interference. Additionally, an online semantic segmentation‑guided loss is introduced to constrain the generated results, improving semantic consistency in complex road scenes. Experiments show that FSCM outperforms existing infrared colorization methods in visual quality and semantic fidelity.

Abstract:
Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi‑supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern‑guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug‑and‑play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi‑dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix‑based fusion strategies across diverse datasets and labeled ratios as a plug‑and‑play module compatible with various SemiSeg algorithms.

Abstract:
Data selection seeks to identify a compact yet informative subset from large‑scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high‑quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) node value calibration that restricts influence estimation to the bilateral salient subspace to ground node importance in task‑relevant signals rather than surface‑level statistics; (2) local scale normalization that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross‑domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \textttHoneybee‑Remake‑SEED‑200K, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state‑of‑the‑art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.

Abstract:
We present a highly detailed instance segmentation model for delineating individual tree crowns in natural broadleaf forests using aerial imagery acquired by unmanned aerial vehicles (UAVs). Tree crown delineation in broadleaf forests is more challenging than in other forest types due to diversity of crown shapes and the lack of clearly defined treetops. To address this issue, we developed a deep‑learning‑based crown segmentation model trained on high‑quality annotated crown outlines. We manually delineated 18,507 crown polygons from orthomosaic images collected across seven forests in Japan by skilled annotators, and developed a model based on Mask2Former with multiple backbone architectures. The best model achieved high segmentation performance in structurally complex broadleaf forests using only RGB imagery. This performance was maintained when applied to geographically distinct forests within Japan, as well as to biologically distinct tropical rainforests in Borneo. These results demonstrate that using a large number of high‑quality annotated datasets is critical for achieving detailed and generalizable crown segmentation across diverse forest ecosystems. The developed model has been integrated into DF Scanner Pro, a software that supports practical forest monitoring using UAVs, and this implementation is expected to enable a wide range of users to analyze tree‑level information in broadleaf forest from UAVs.

Abstract:
We introduce an empowered transposed Fully Connected Weighted (t‑FCW) graph representation to embed point clouds into a metric space. While original t‑FCW has shown promising results for point cloud classification, the reasons behind its effectiveness and its broader applicability remained unclear. In this work, we analyze the properties that make the empowered and original t‑FCW effective and design a network that uses the empowered t‑FCW exclusively as feature extractors. From an interpretability perspective, we build memory banks for classification, part segmentation, and semantic segmentation using the empowered t‑FCW. Our analysis reveals that the empowered t‑FCW inherits robustness from surface descriptors, provides interpretability through dimension‑wise relations. These properties enable a highly efficient and interpretable network, which processes the ModelNet40 classification problem in approximately 7 seconds on an NVIDIA RTX A5000 GPU. Importantly, empowered t‑FCW can function both as a lightweight standalone baseline and as a complementary plug‑in to existing deep models.

Abstract:
Concealed Object Segmentation (COS) encompasses a family of dense‑prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emphheterogeneous decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel‑aligned cues less direct. We introduce a fundamentally different perspective: homogeneous image decomposition via Retinex theory, which factorizes an image into illumination and reflectance components within the \emphsame spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emphnot necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the Discriminability Gap Theorem. Crucially, we show that across diverse COS sub‑tasks, the underlying physical processes systematically anti‑correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground‑‑background discriminability across the full physical regime, with anti‑correlation maximizing the gain. Building on this, we propose RIDE comprising: (i) a Task‑Driven Retinex Decomposition module that learns segmentation‑optimal factorizations end‑to‑end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage‑Breaking Contrastive loss operating in reflectance feature space.

Abstract:
Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long‑term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large‑scale observation, but often misses small mining‑related structures and subtle land‑cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large‑scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel‑level semantic labels for both mining‑related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation‑derived recognition, direct multi‑label classification, and class‑presence recognition with vision‑language models. Across these tasks, we compare generic and remote‑sensing‑specific segmentation models, vision foundation model‑related segmentation methods, direct multi‑label classification methods, and vision‑language models under a controlled closed‑set protocol. Results show that current methods still struggle with rare small‑scale mining structures and fine‑grained recovery classes, suggesting the need for context‑aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.

Abstract:
Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self‑attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather‑Interact‑Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity‑based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query‑driven cross‑attention. Via replacing coordinate‑driven aggregation with representation‑driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each input.RPAttention reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.

Abstract:
Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision‑language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training‑free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention‑based grounding and proposes to steer attention at its source through input‑level conditioning. SteerSeg combines learnable soft prompts with reasoning‑guided Chain‑of‑Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT‑derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation‑based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model's pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref‑YouTube‑VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: https://steerseg.github.io

Abstract:
RGB‑T semantic segmentation requires strictly aligned VIS‑IR‑Label triplets; however, such aligned triplet data are often scarce in real‑world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross‑modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS‑IR‑Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross‑modal consistency. Lightweight modality‑specific residual adapters are further integrated into this mechanism to accommodate modality‑specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene‑balanced and class‑aware few‑shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high‑quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB‑T semantic segmentation models.

Abstract:
Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault‑relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task‑relevant information. Experimental results obtained through time‑aware cross‑validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra‑segment coherence than inter‑segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information‑preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.

Abstract:
Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel‑wise deep models often yield anatomically inconsistent results that diverge from expert‑defined boundaries. In this research, we propose a landmark‑guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard‑‑Oxford Atlas. A Global‑to‑Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark‑driven post‑processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.

Abstract:
LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion‑based LiDAR generators are developed under single‑domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text‑conditioned diffusion framework that generates LiDAR scans in a shared range‑image representation across eight representative domains spanning three shift types: adverse weather, sensor‑configuration changes (e.g., reduced beams), and cross‑platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross‑Domain Training Strategy (CDTS) that mixes domains within each mini‑batch and leverages conditioning to steer generation. We further propose Cross‑Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain‑Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain‑dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8‑domain dataset by combining real‑world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited‑label regimes.

Abstract:
Weakly supervised semantic segmentation (WSSS) trains dense pixel‑level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image‑level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo‑labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain‑specific priors are unified as continuous logical constraints that fine‑tune SAM under weak supervision. The refined foundation model then produces improved pseudo‑labels, from which we train a second‑stage prompt‑free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic‑guided fine‑tuning yields higher‑quality pseudo‑labels, leading to state‑of‑the‑art segmentation accuracy that often exceeds densely supervised baselines.

Abstract:
Object‑centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi‑stage pipelines that first apply pre‑trained segmentors to extract individual objects, followed by per‑object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object‑Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per‑pixel attributes, including CLIP‑based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre‑rendered canonical ground truth, avoiding costly per‑image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state‑of‑the‑art performance across monocular depth estimation, open‑vocabulary semantic segmentation, and RGB‑only category‑level 6D pose estimation, while producing high‑fidelity, editable per‑object reconstructions. Crucially, inference is fully feed‑forward and scales independently of the number of objects, offering orders‑of‑magnitude speedups over conventional multi‑stage pipelines in cluttered scenes.

Abstract:
Language‑guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two‑stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off‑the‑shelf MLLMs, these methods often rely on extensive training on large‑scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain‑of‑thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg‑Agent, a completely training‑free framework that pioneers Explicit Multimodal Chain‑of‑Reasoning. Unlike prior text‑only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set‑of‑Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg‑Agent to achieve performance comparable to state‑of‑the‑art training‑based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various‑LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning‑guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

Abstract:
Segmentation models in automated optical inspection of wire‑bonded semiconductors are typically device‑specific and must be re‑trained when new devices or distribution shifts appear. We introduce AOI‑SSL, a training‑efficient framework for semantic segmentation of wire‑bonded semiconductors by combining small‑domain self‑supervised pre‑training of vision transformers with in‑context inference that minimizes the need of labeled examples. We pre‑train SOTA self‑supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small‑data setting, improving downstream segmentation while reducing the labeled fine‑tuning effort. We further introduce in‑context, patch‑level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity‑based retrieval performs on par with more complex attention‑based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self‑supervised pre‑training significantly improves segmentation quality compared to training from scratch and to ImageNet pre‑trained backbones under a fixed fine‑tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine‑tuning when targeting single device images, allowing for near‑instant adaptation to difficult samples.

Abstract:
Vision‑Language‑Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the interaction. We introduce Premover, a lightweight module that converts this idle window into useful precomputation. Premover keeps the VLA backbone frozen and attaches two small projection heads, one for image patches, one for language tokens, that map an intermediate layer of the backbone into a shared space. The resulting focus map is supervised by simulator‑rendered target‑object segmentation masks and applied as a per‑patch reweighting of the next step's image tokens. A single scalar readiness threshold, trained jointly from streaming prefixes, decides when the policy should begin acting. On the LIBERO benchmark suite, Premover reduces mean wall‑clock time from 34.0 to 29.4 seconds, a 13.6% reduction, while matching the full‑prompt baseline's success rate (95.1% vs. 95.0%); naive premoving, by contrast, collapses to 66.4%.

Abstract:
Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross‑modal generalizability with modality‑specific structure. Continuous (implicit) methods preserve fine‑grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross‑modal Discrete Alignment And Reconstruction), a novel framework that resolves this long‑standing trade‑off by establishing semantic consensus across modality‑specific codebooks through index‑level alignment. This design uniquely allows CoDAAR to preserve modality‑unique structures while achieving generalizable cross‑modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine‑grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross‑modal semantic agreement. Together, they establish a competition‑free unified representation space. Trained with self‑supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross‑modal and cross‑domain generalization. Across Cross‑Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross‑dataset transfer, CoDAAR achieves state‑of‑the‑art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.

Abstract:
Many image understanding tasks involve identifying what is present and where it appears. However, tasks that address where, such as object discovery, detection, and segmentation, are often considerably more complex than image classification, which primarily focuses on what. One possible reason is that classification‑oriented backbones tend to emphasize semantic information about what, while implicitly entangling or suppressing information about where. In this work, we focus on an inductive bias termed what‑where separation, which encourages models to represent object appearance and spatial location in a decomposed manner. To incorporate this bias throughout an attentive backbone in the style of Vision Transformer (ViT), we propose the What‑Where Transformer (WWT). Our method introduces two key novel designs: (1) it treats tokens as representations of what and attention maps as representations of where, and processes them in concurrent feed‑forward modules via a multi‑stream, slot‑based architecture; (2) it reuses both the final‑layer tokens and attention maps for downstream tasks, and directly exposes them to gradients derived from task losses, thereby facilitating more effective and explicit learning of localization. We demonstrate that even under standard single‑label classification‑based supervision on ImageNet, WWT exhibits emergent multiple object discovery directly from raw attention maps, rather than via additional postprocessing such as token clustering. Furthermore, WWT achieves superior performance compared to ViT‑based methods on zero‑shot object discovery and weakly supervised semantic segmentation, and it is transferable to various localization setups with minimal modifications. Code will be published after acceptance.

Abstract:
State Space Models (SSMs) have emerged as a compelling alternative to attention models for long‑range vision tasks, offering input‑dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state‑dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token‑Conditioned Poles SSM (TCP‑SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP‑SSM builds each scan operator with 1) real poles that model monotone or sign‑alternating decay, and 2) complex‑conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP‑SSM converts shared base poles into token‑dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low‑rank input pathway, yielding an efficient scan operator that preserves linear‑time scan complexity. Across image classification, semantic segmentation, and object detection, TCP‑SSM reduces SSM computation complexity up to 44% in Vision Mamba‑style models while maintaining or surpassing baseline accuracy.

Abstract:
Vision state space models inherit the efficiency and long‑range modeling ability of Mamba‑style selective scans. However, their performance depends critically on the representation of two‑dimensional visual features as one‑dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate‑based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state‑space mixing. This motivates a simple question: \emphcan graphs help vision state space models see better? We introduce GraphScan, a graph‑induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature‑conditioned affinities with relative positional bias, and produces the output token by one‑step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate‑conditioned interpolation with feature‑conditioned semantic routing. Integrated into a hierarchical backbone, GraphScan‑Mamba achieves state‑of‑the‑art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state‑space modeling.

Abstract:
The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high‑quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture‑of‑Experts Perception‑Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high‑quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land‑covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.

Abstract:
Existing self‑supervised learning (SSL) methods primarily learn object‑invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part‑to‑part relationships in a continuous geometric space, SP encourages representations to capture fine‑grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug‑in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine‑grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out‑of‑distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.

Abstract:
We present Urban‑ImageNet, a large‑scale multi‑modal dataset and evaluation benchmark for urban space perception from user‑generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019‑2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large‑scale training and evaluation. Urban‑ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10‑class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non‑activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non‑spatial social‑media content. Rather than treating urban imagery as generic scene data, Urban‑ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross‑modal image‑text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision‑language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross‑modal retrieval and instance‑level urban object segmentation. A multi‑scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban‑ImageNet provides a unified, theory‑grounded, multi‑city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei‑Ou/Urban‑ImageNet and github.com/yiasun/dataset‑2.

Abstract:
Rapid and accurate damage assessment following natural disasters is critical for effective emergency response. However, identifying fine‑grained damage levels (e.g., distinguishing minor from major roof damage) in UAV imagery remains challenging due to the degradation of texture cues during resizing and extreme class imbalance. We propose DA‑SegFormer, a damage‑aware adaptation of the SegFormer architecture optimized for high‑resolution disaster imagery. Our method introduces a Class‑Aware Sampling strategy to guarantee exposure to rare damage features, and it integrates Online Hard Example Mining (OHEM) with Dice Loss to dynamically focus on underrepresented classes. In addition, we employ a resolution‑preserving inference protocol that maintains native texture details. Evaluated on the RescueNet dataset, DA‑SegFormer achieves 74.61% mIoU, outperforming the baseline by 2.55%. Notably, our improvements yield double‑digit gains in critical damage classes: Minor Damage (+11.7%) and Major Damage (+21.3%).

Abstract:
Interactive segmentation allows efficient label generation by leveraging user‑provided clicks to progressively refine predictions, which is critical when fully supervised labels are costly or generalization to unseen classes is needed. Existing 3D interactive methods are limited: most operate sequentially, predicting only one object per iteration with binary masks, while several recent approaches depend on 2D foundation models and camera alignment to bridge the 2D‑3D gap. To address these limitations, we propose a novel interactive segmentation framework that operates directly on sparse, randomly downsampled 3D points and processes multiple object clicks in a single forward pass. Our framework consists of a point Transformer‑based encoder and a hierarchical mask decoder, which integrates multi‑level crop‑and‑merge operations conditioned on learnable semantic embeddings. Unlike prior interactive approaches that require repeated model updates after each manually corrective click, our method jointly reasons over all click queries, modeling inter‑instance relationships and refining both spatial masks and semantic predictions through spatial and semantic embeddings. Extensive experiments demonstrate that our model improves the mIoU metric by over 20 percent compared to strong baselines and achieves 8‑10 percent gains under cross‑dataset evaluation for a one‑click per instance setting, often requiring only a single click per object. Our approach provides a generalizable and efficient solution for interactive 3D instance segmentation, particularly suitable for real‑time applications such as robotic manipulation, navigation, and rapid 3D semantic annotation.

Abstract:
Post‑disaster situational awareness relies heavily on understanding both the extent and the volume of floodwaters. While 2D semantic segmentation provides accurate flood masking, it lacks the vertical dimension required to assess navigability and structural risk. This paper presents a geometric "Water Surface Elevation" approach for estimating flood depth from monocular aerial imagery. Our pipeline utilizes Mask2Former, a state‑of‑the‑art transformer‑based segmentation model, to generate precise 2D flood masks. These masks are fused with Digital Elevation Models (DEMs) to identify the water‑land boundary, calculate a global water surface elevation (Z_water), and compute per‑pixel depth based on the principle of local hydrostatic equilibrium. We evaluate this workflow using the FloodNet and CRASAR‑U‑DROIDS datasets, demonstrating how high‑performance segmentation can be leveraged to extract 3D volumetric data from 2D imagery without the latency of hydrodynamic simulations.

Abstract:
Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding‑related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi‑instance target references; (3) the language expressions over‑rely on textual cues or data‑rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code‑driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general‑purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA‑derived real‑chart grounding benchmark.

Abstract:
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high‑resource written text, limiting their effectiveness on low‑resource spoken varieties. In particular, dialectal Arabic exhibits informal syntax, code‑switching, and weakly marked discourse structure that challenge standard segmentation approaches. In this paper, we introduce a new multi‑genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse. The benchmark covers transcribed casual telephone conversations, code‑switched podcasts, broadcast news, and expressive dialogue from novels, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal transcribed speech. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non‑news genres. The benchmark and approach generalize to other low‑resource spoken languages.

Abstract:
Segment Anything Model 2 (SAM2) demonstrated impressive zero‑shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt‑free, parameter‑efficient fine‑tuning framework designed for multi‑class segmentation on variable‑sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual‑adapter strategy: High‑Performance Adapter utilizing deformable convolutions for precise boundary modeling and Lightweight Adapter employing structural re‑parameterization to minimize inference latency. Experiments on ISBI 2012, Kvasir‑SEG, Synapse, and ACDC datasets demonstrate that our approach significantly outperforms strong adaptation baselines. Specifically, our method improved segmentation accuracy by up to 19.66% over the vanilla SAM2, while reducing computational costs by approximately 87% compared to heavyweight medical SAM adaptations, establishing a superior trade‑off between accuracy and efficiency.

Abstract:
Multi‑task dense prediction solves complementary pixel‑level tasks in a unified model, such as semantic segmentation, depth estimation, surface normal estimation, and edge detection. Existing decoder‑side interactions use attention, prompts, routing, diffusion, Mamba, or bridge features to exchange task evidence, but most of them organize this evidence implicitly. They usually fuse task features by similarity or affinity, without explicitly modeling that evidence reliability varies across tasks and spatial locations. As a result, unreliable evidence may contaminate the shared representation and intensify negative transfer. We propose \mathcalB^3‑Net, a controlled posterior bridge learning framework for multi‑task dense prediction. Our method decomposes decoder‑side interaction into reliability estimation, posterior bridge construction, and bounded redistribution. The Precision Field Estimator estimates patch‑wise evidence precision from task‑reference alignment and local variation. The Posterior Bridge Operator builds a precision‑weighted posterior bridge through heteroscedastic evidence fusion, yielding a shared state more reliable than uniform or heuristic mixtures. The Contractive Dispatch Operator redistributes the bridge to each task branch through a bounded update, reducing uncontrolled feature injection. Experiments on NYUD‑v2, PASCAL‑Context, and Cityscapes show that \mathcalB^3‑Net achieves competitive or superior trade‑offs over representative CNN‑, Transformer‑, diffusion‑, Mamba‑, and bridge‑feature‑based methods. Backbone‑matched comparisons and extensive analyses further verify that the gains arise from controlled posterior bridge learning rather than backbone capacity or decoder scale.

Abstract:
Rheumatoid arthritis (RA) assessment from hand radiographs requires multi‑level analysis and modeling of anatomical structures and fine‑grained local pathological changes. However, existing public resources do not support such unified multi‑level analysis, often lacking full‑hand coverage, fine‑grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM‑H1200 contains 1,200 hand radiographs collected from six medical centers, with multi‑level annotations including (i) whole‑hand bone structure instance segmentation, (ii) pixel‑level BE masks, (iii) SvdH‑defined joint regions of interest, and (iv) joint‑level SvdH scores for both BE and joint space narrowing (JSN). It is designed to evaluate whether models can jointly capture anatomical structure, localized erosive pathology, and clinically standardized RA severity from hand radiographs. The proposed BE masks enable, for the first time, quantitative BE analysis beyond coarse categorical grading by providing explicit spatial supervision for lesion extent and morphology. To our knowledge, RAM‑H1200 is the first public large‑scale benchmark that jointly supports whole‑hand bone structure instance segmentation, pixel‑level BE delineation, and clinically grounded joint‑level SvdH scoring for both BE and JSN. Results across benchmark tasks show that anatomical modeling is substantially more mature than quantitative BE analysis: whole‑hand bone segmentation achieves strong performance, whereas BE segmentation remains a major open challenge. By unifying anatomical structure modeling, quantitative lesion analysis, and clinically grounded SvdH scoring, RAM‑H1200 provides a single benchmark for comprehensive RA analysis on hand radiographs.

Abstract:
Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed‑source models~(Veo~3, Nano Banana Pro) or requires task‑specific instruction tuning, leaving open whether publicly available image‑editing models possess zero‑shot vision abilities out of the box. We conduct a systematic evaluation of three open‑source image‑editing models ‑‑ Qwen‑Image‑Edit, FireRed‑Image‑Edit, and LongCat‑Image‑Edit ‑‑ on dense visual prediction tasks \emphwithout any fine‑tuning. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open‑source image‑editing models exhibit non‑trivial zero‑shot visual understanding. On NYUv2 surface normals, FireRed‑Image‑Edit achieves a mean angular error of 17.69^\circ, surpassing the fine‑tuned Marigold (20.86^\circ) and matching the instruction‑tuned Vision Banana (17.78^\circ) without any task‑specific training. On NYUv2 depth estimation, LongCat‑Image‑Edit obtains δ_1=0.822 with affine alignment, and Qwen‑Image‑Edit leads on DIODE Indoor (δ_1=0.868). On Cityscapes semantic segmentation, Qwen‑Image‑Edit reaches 25.7 mIoU at the 19‑class level and 49.5 mIoU at a coarser 7‑category level. By comparing three independently trained editors, we test whether zero‑shot vision ability is an emergent property of image‑editing pretraining rather than a model‑specific artifact. Code, evaluation scripts, and all results are publicly released to serve as a reproducible baseline for future work.

Abstract:
Vision Transformers (ViTs) achieve state‑of‑the‑art segmentation accuracy but require large training datasets because each layer has unique parameters that must be learned independently. We present RD‑ViT, a Recurrent‑Depth Vision Transformer that adapts the Recurrent‑Depth Transformer (RDT) architecture to dense prediction tasks, supporting both 2D and 3D inputs. RD‑ViT replaces the deep stack of unique transformer blocks with a single shared block looped T times, augmented with LTI‑stable state injection for guaranteed convergence, Adaptive Computation Time (ACT) for spatial compute allocation, depth‑wise LoRA adaptation, and optional Mixture‑of‑Experts (MoE) feed‑forward networks for category‑specific specialization. We evaluate on the ACDC cardiac MRI segmentation benchmark in both 2D slice‑level and 3D volumetric settings with exclusively real experiments executed in Google Colab. In 2D, RD‑ViT outperforms standard ViT at 10% training data (Dice 0.774 vs 0.762) and at full data (0.882 vs 0.872). In 3D, RD‑ViT with MoE achieves Dice 0.812 with 3.0M parameters, reaching 99.4% of standard ViT performance (0.817) at 53% of the parameter count. MoE expert utilization analysis reveals that different experts spontaneously specialize for different cardiac structures (RV, MYO, LV) without explicit routing supervision. ACT halting maps show higher compute allocation at cardiac boundaries, and the mean ponder time decreases from 2.6 to 1.4 iterations during training, demonstrating learned computational efficiency. Depth extrapolation enables inference with more loops than training without degradation. All code, notebooks, and results are publicly released.

Abstract:
Accurate school detection is essential for supporting education initiatives, including infrastructure planning and expanding internet connectivity to underserved areas. However, many regions around the world face challenges due to outdated, incomplete, or unavailable official records. Manual mapping efforts, while valuable, are labor‑intensive and lack scalability across large geographic areas. To address this, we propose a weakly supervised framework for school detection from aerial imagery that minimizes the need for human annotations while supporting global mapping efforts. Our method is specifically designed for low‑data regimes, where manual annotations are extremely scarce. We introduce an automatic labeling pipeline that leverages sparse location points and semantic segmentation to generate infrastructure masks from which we generate bounding boxes. Using these automatically labeled images, we train our detectors on a first training stage to learn a representation of what schools look like, then using a small set of manually labeled images, we fine‑tune the previously trained models on this clean dataset. This two stage training pipeline enables large‑scale and strong detection in low‑data setting of school infrastructure with minimal supervision. Our results demonstrate strong object detection performance, particularly in the low‑data regime, where the models achieve promising results using only 50 manually labeled images, significantly reducing the need for costly annotations. This framework supports education and connectivity initiatives worldwide by providing an efficient and extensible approach to mapping schools from space. All models, training code and auto‑labeled data will be publicly released to foster future research and real‑world impact.

Abstract:
Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human‑robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample‑based and ensembles approaches for uncertainty estimation. We extend an attention‑based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel‑wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability‑based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub‑networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention‑based mechanisms, represent an improvement of +7.4 p.p on the F_β^w score in the challenging IIT‑Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.

Abstract:
Continual semantic segmentation requires models to adapt to new domains or modalities without sacrificing performance on previously learned tasks. Expert‑based learning, in which task‑specific modules specialize in different domains, has proven effective in mitigating forgetting. These methods include dynamic expansion, which suffers from scalability issues, or parameter isolation, which constrains the ability to learn new tasks. We introduce Mixture of Incremental LoRA Experts (MILE), a modular and parameter‑efficient framework for continual segmentation across both domains and modalities. MILE leverages Low‑Rank Adaptation (LoRA) to instantiate lightweight experts for each new task while keeping the pretrained base network frozen. Each expert is trained exclusively on its task data, thus avoids overwriting previously learned information. A prototype‑guided gating mechanism dynamically selects the most appropriate expert at inference. MILE achieves the benefits of expert‑based learning while overcoming its scalability limitations. It requires only a marginal parameter increase per task and tens of LoRA adapters are needed before matching the size of a single full model, making it highly efficient in both training and storage. Across domain‑ and modality‑incremental benchmarks, MILE achieves strong performance while ensuring better stability, plasticity, and scalability.

Abstract:
Semantic segmentation provides pixel‑level scene understanding essential for autonomous driving and fine‑grained perception tasks. However, training segmentation models requires costly, labor‑intensive annotations on real‑world datasets. Unsupervised Domain Adaptation (UDA) addresses this by training models on labeled synthetic data and adapting them to unlabeled real images. While conceptually simple, adaptation is challenging due to the domain gap, i.e., differences in visual appearance and scene structure between synthetic and real data. Prior approaches bridge this gap through pixel‑level mixing or feature‑level contrastive learning. Yet, these techniques suffer from two major limitations: (1) reliance on high‑confidence pseudo‑labels restricts learning to a subset of the target domain, and (2) prototype‑based contrastive methods initialize class prototypes from source‑trained models, yielding biased and unstable anchors during adaptation. To address these issues, we propose a dual‑foundation UDA framework that leverages two complementary foundation models. First, we employ the Segment Anything Model (SAM) with superpixel‑guided prompting to enable learning from a broader range of target pixels beyond high‑confidence predictions. Second, we incorporate DINOv3 to construct stable, domain‑invariant class prototypes through its robust representation learning. Our method achieves consistent improvements of +1.3% and +1.4% mIoU over strong UDA baselines on GTA‑to‑Cityscapes and SYNTHIA‑to‑Cityscapes, respectively.

Abstract:
We present FoR‑Net, an efficient semantic segmentation framework that focuses on identifying and enhancing hard regions. Instead of relying on heavy global modeling, FoR‑Net adopts an efficient strategy that selectively emphasizes informative regions through a learned importance map and a Top‑K activation mechanism. Specifically, a selector module predicts region‑wise importance, enabling the model to focus on challenging areas such as thin structures and object boundaries. Multi‑scale reasoning is achieved using convolutional branches with different receptive fields, allowing diverse spatial context aggregation. We evaluate FoR‑Net on the Cityscapes benchmark under limited computational resources. Despite its efficient design and standard training configuration, FoR‑Net achieves competitive performance and exhibits improved attention to difficult regions. These results suggest that selective region‑focused reasoning can serve as a practical and efficient alternative for semantic segmentation. This work explores region‑focused reasoning under resource‑constrained settings and provides insights for developing efficient and region‑aware segmentation models.

Abstract:
Real‑time crack segmentation is vital for structural health monitoring but is plagued by aleatoric uncertainties arising from varying lighting, blur, and texture ambiguity. Current uncertainty‑aware approaches typically treat uncertainty estimation as a passive endpoint for post‑hoc analysis, failing to close the loop by feeding this information back to refine feature representations. We contend that independent pixel‑wise heteroscedastic modeling is uniquely suited for crack segmentation, as cracks are defined by fine‑grained local gradients rather than the global semantic coherence relied upon in general object segmentation. However, this approach suffers from a structural optimization pathology: high predicted variance attenuates loss gradients, effectively causing the model to ignore difficult samples and under‑fit complex boundaries. To address these challenges, we propose UnGAP, a novel framework that establishes a closed‑loop mechanism between uncertainty estimation and feature learning. Central to our approach is the Uncertainty‑Prompted Feature Modulator (UPFM), which treats aleatoric uncertainty as an active visual prompt rather than a mere output. UPFM dynamically calibrates feature distributions through pixel‑wise affine transformations. Crucially, this mechanism mitigates the heteroscedastic pathology by transforming high variance, which would otherwise indicate gradient suppression, into a constructive signal for stronger feature rectification in ambiguous regions. Additionally, a boundary‑aware detection head is introduced to further constrain prediction precision. Extensive experiments demonstrate that UnGAP balances superior segmentation accuracy with real‑time inference speed, effectively validating the benefit of transforming uncertainty from a passive metric into an active calibration tool.

Abstract:
Infrared‑visible image fusion aims to create an information‑rich fused image by integrating the complementary thermal saliency from infrared sensing and fine textures from visible imaging. Such accurate fusion is essential for real‑world perception applications in complex scenes, including nighttime autonomous driving, search and rescue, and surveillance, and can further benefit downstream tasks such as semantic segmentation. However, most existing fusion methods rely upon static trained weights that cannot adapt to scene‑specific content at inference time, and often suffer from a granularity mismatch when coarse auxiliary semantics are injected, which makes it difficult to simultaneously highlight targets and preserve details. In this work, we propose EAPFusion to address these issues by using self‑evolving intrinsic priors instead of relying on external auxiliary models. Concretely, EAPFusion maintains a compact set of intrinsic priors and progressively updates them across scales. These evolved priors are utilized to dynamically generate convolutional kernels, shifting the paradigm from fixed, pre‑trained filters to instance‑adaptive parameters via prior‑conditioned dynamic convolution. Furthermore, we design a channel‑level fusion module that shuffles and interleaves infrared and visible channels, applying local channel mixing to boost cross‑modal complementarity. Experiments on different datasets, including cross‑dataset evaluation and semantic segmentation, show that the proposed method achieves state‑of‑the‑art quantitative and qualitative fusion results, and consistently boosts downstream performance. Code is coming soon.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated strong image‑level visual understanding and reasoning, yet their pixel‑level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high‑quality masks, but they rely on low‑level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any‑segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open‑vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V‑VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.

Abstract:
Deep Unfolding Network‑based methods have emerged as effective solutions for multi‑source image fusion by combining model‑driven iterative optimization with data‑driven deep learning. However, most existing deep unfolding image fusion methods are derived from alternating minimization, which updates the features of different modalities separately. This design introduces considerable computational and memory overhead, limiting deployment on resource‑constrained edge devices. To address this issue, we propose CDNet, a lightweight Combined Dictionary Unfolding Network for multi‑source image fusion. Rather than introducing a new sparse coding prior or empirically compressing an existing fusion network, CDNet translates the unique‑common decomposition prior of coupled dictionary learning into a structurally constrained joint unfolding architecture. The resulting CDBlock follows a block‑sparse interaction topology and performs a model‑derived joint update of common and modality‑specific representations, thereby streamlining feature learning and improving efficiency.In addition, we design a compact High‑ and Low‑frequency Image Fidelity loss for unsupervised training without ground‑truth images. We evaluate CDNet on four tasks, including multi‑exposure image fusion, infrared and visible image fusion, medical image fusion, and infrared and visible image fusion for semantic segmentation. Experimental results show that CDNet achieves competitive or superior fusion performance with high efficiency. For infrared and visible image fusion, CDNet outperforms competing methods on four of six metrics on the TNO dataset and five of six metrics on the RoadScene dataset. In particular, it surpasses the second‑best method by 1.23 dB and 1.59 dB in PSNR on TNO and RoadScene, respectively.

Abstract:
We present MiniVLA‑Nav v1, a simulation dataset for Language‑Conditioned Object Approach (LCOA) navigation: given a short natural‑language instruction, an NVIDIA Nova Carter differential‑drive robot must navigate to the named object and stop within 1 m across four photorealistic Isaac Sim environments (Office, Hospital, Full Warehouse, and Warehouse with Multiple Shelves). Each of the 1,174 episodes pairs an instruction with synchronized 640x640 RGB images, metric depth maps (float32, metres), and instance segmentation masks, together with continuous (v,omega) and 7x7 tokenized expert action labels recorded at 60 Hz from a vision‑based proportional controller. Trajectory diversity is ensured through three spawn‑distance tiers (near: 1.5‑3.5 m, mid: 3.5‑7.0 m, far: global curated points; Pearson r=0.94 between spawn distance and trajectory length), 12 object categories, 18 training templates, and 12 paraphrase‑OOD templates. Five evaluation splits support in‑distribution accuracy, template‑paraphrase robustness, and OOD object‑category benchmarking. The dataset is publicly available at https://huggingface.co/datasets/alibustami/miniVLA‑Nav

Abstract:
Developing robust techniques for super‑resolution of satellite imagery involves navigating commonly observed trade‑offs between spectral fidelity and perceptual quality. In this work, we introduce a flow matching model for 4x super‑resolution of 10‑m Sentinel‑2 visible and near‑infrared bands over the conterminous United States (CONUS) using a dataset of 120,851 10‑m Sentinel‑2 and 2.5‑m resampled NAIP imagery pairs acquired on the same day. Our results showed that the flow matching model outperformed diffusion and Real‑ESRGAN models in pixel‑wise accuracy in a single sampling step using the Euler method. When evaluated with a second‑order Midpoint solver, our model generated perceptually realistic super‑resolved imagery in only 20 sampling steps, effectively navigating the perception‑distortion trade‑off at inference time without retraining. We used this model to produce a super‑resolved 2.5‑m 4‑band CONUS imagery product derived from 2025 10‑m Sentinel‑2 annual composites, consisting of over 1.58 trillion pixels. We further evaluated the use of super‑resolved data on a land cover classification task using semantic segmentation models. Finally, we generated a yearly 2.5‑m land cover product for the Chesapeake Bay watershed for 2020‑2025. An accuracy assessment against 25,000 ground truth points revealed an overall accuracy of 89.11% for the annual land cover product. We conclude that flow matching is an effective generative modeling approach for super‑resolution of Sentinel‑2 imagery compared to diffusion and Generative Adversarial Network‑based methods, and has strong implications for expanding access to high‑resolution imagery for geospatial applications that demand fine spatial detail.

Abstract:
Event cameras provide several unique advantages over standard frame‑based sensors, including high temporal resolution, low latency, and robustness to extreme lighting. However, existing learning‑based approaches for event processing are typically confined to narrow, task‑specific silos and lack the ability to generalize across modalities. We address this gap with REALM, a cross‑modal framework that learns an RGB and Event Aligned Latent Manifold by projecting event representations into the pretrained latent space of RGB foundation models. Instead of task‑specific training, we leverage low‑rank adaptation (LoRA) to bridge the modality gap, effectively unlocking the geometric and semantic priors of frozen RGB backbones for asynchronous event streams. We demonstrate that REALM effectively maps events into the ViT‑based foundation latent space. Our method allows us to perform downstream tasks like depth estimation and semantic segmentation by simply transferring linear heads trained on the RGB teacher. Most significantly, REALM enables the direct, zero‑shot application of complex, frozen image‑trained decoders, such as MASt3R, to raw event data. We demonstrate state‑of‑the‑art performance in wide‑baseline feature matching, significantly outperforming specialized architectures. Code and models are available upon acceptance.

Abstract:
Reconstructing 3D scenes from sparse, unposed images remains challenging under real‑world conditions with varying illumination and transient occlusions. Existing methods rely on scene‑specific optimization using appearance embeddings or dynamic masks, which requires extensive per‑scene training and fails under sparse views. Moreover, evaluations on limited scenes raise questions about generalization. We present GenWildSplat, a feed‑forward framework for sparse‑view outdoor reconstruction that requires no per‑scene optimization. Given unposed internet images, GenWildSplat predicts depth, camera parameters, and 3D Gaussians in a canonical space using learned geometric priors. An appearance adapter modulates appearance for target lighting conditions, while semantic segmentation handles transient objects. Through curriculum learning on synthetic and real data, GenWildSplat generalizes across diverse illumination and occlusion patterns. Evaluations on PhotoTourism and MegaScenes benchmark demonstrate state‑of‑the‑art feed‑forward rendering quality, achieving real‑time inference without test‑time optimization

Abstract:
Semantic segmentation and change detection are two fundamental challenges in remote sensing, requiring models to capture either spatial semantics or temporal differences from satellite imagery. Existing deep learning models often struggle with temporal inconsistencies or in capturing fine‑grained spatial structures, require extensive pretraining, and offer limited interpretability ‑ especially in real‑world remote sensing scenarios. Recent advances in diffusion models show that Gaussian noise can be systematically leveraged to learn expressive data representations through denoising. Motivated by this, we investigate whether the noise process in diffusion models can be effectively utilized for discriminative tasks. We propose Noise2Map, a unified diffusion‑based framework that repurposes the denoising process for fast, end‑to‑end discriminative learning. Unlike prior work that uses diffusion only for generation or feature extraction, Noise2Map directly predicts semantic or change maps using task‑specific noise schedules and timestep conditioning, avoiding the costly sampling procedures of traditional diffusion models. The model is pretrained via self‑supervised denoising and fine‑tuned with supervision, enabling both interpretability and robustness. Our architecture supports both tasks (SS and CD) through a shared backbone and task‑specific noise schedulers. Extensive evaluations on the SpaceNet7, WHU, and xView2 buildings damaged by wildfires datasets demonstrate that Noise2Map ranks on average 1st among seven models on semantic segmentation and 1st on change detection by a cross‑dataset rank metric (average F1 primary, IoU tie‑break). Ablation studies highlight the robustness of our model against different training noise schedulers and timestep control in the diffusion process, as well as the ability of the model to perform multi‑task learning.

Abstract:
In the segmentation of remotely sensed images, deep learning models are typically pre‑trained using large image databases like ImageNet before fine‑tuned on domain‑specific datasets. However, the performance of these fine‑tuned models is often hindered by the large domain gaps (i.e., differences in scenes and modalities) between ImageNet's images and remotely sensed images being processed. Therefore, many researchers have undertaken efforts to establish large‑scale domain‑specific image datasets for pre‑training, aiming to enhance model performance. However, establishing such datasets is often challenging, requiring significant effort, and these datasets often exhibit limited generaliza‑bility to other application scenarios. To address these issues, this study introduces a novel yet simple pre‑training strategy designed to guide a model away from learning domain‑specific features in a pre‑training dataset during pre‑training, thereby improving the generalisation ability of the pre‑trained model. To evaluate the strategy's effectiveness, deep learning models are pre‑trained on ImageNet and subsequently fine‑tuned on four semantic segmentation datasets with diverse scenes and modalities, including iSAID, MFNet, PST900 and Potsdam. Experimental results show that the proposed pre‑training strategy led to state‑of‑the‑art accuracies on all four datasets, namely 67.4% mIoU for iSAID, 56.9% mIoU for MFNet, 84.22% mIoU for PST900, 91.88% mF1 for Potsdam. This research lays the groundwork for developing a unified foundation model applicable to both computer vision and remote sensing applications.

Abstract:
Semantic segmentation in remote sensing is commonly addressed using classical deep learning architectures such as U‑Net, which require a large number of parameters to model complex spatial relationships. Quantum machine learning (QML) provides an alternative representation paradigm by mapping classical features into quantum states, but its direct application to high‑dimensional images remains challenging under near‑term quantum hardware constraints. In this work, we propose HQ‑UNet, a hybrid quantum‑classical U‑Net architecture that integrates a compact parameterized quantum circuit at the bottleneck of a classical U‑Net. The proposed design uses a non‑pooling quantum convolutional module to enrich highly compressed encoder features before decoding, while keeping the quantum component shallow and parameter‑efficient. Experiments on the LandCover.ai dataset show that HQ‑UNet achieves a mean IoU of 0.8050 and an overall accuracy of 94.76%, outperforming the classical U‑Net baseline. These results suggest that compact quantum bottlenecks can enhance feature representation for remote sensing image segmentation under near‑term quantum constraints. This highlights the potential of hybrid quantum‑classical designs as a promising direction for parameter‑efficient dense prediction in Earth observation.

Abstract:
Foundation‑model pipelines for individual‑level livestock monitoring ‑‑ combining open‑vocabulary detection, promptable video segmentation, and self‑supervised visual embeddings ‑‑ have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M‑parameter Perception Encoder (PE‑ViT‑L+) backbone of SAM 3 is distilled into a 40.66M‑parameter multi‑scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT‑21M‑512, a four‑term direction‑then‑scale distillation loss, and backbone‑substitution inference with sliding‑window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre‑distilled ViT‑S/16 variant (21.6M parameters) released alongside a 6716M‑parameter ViT‑7B teacher; the ViT‑S (21M) variant is adopted as the per‑individual embedder. On the Edinburgh Pig dataset, the compressed pipeline reaches 92.29% MOTA and 96.15% IDF1 against the SAM 3 teacher (1.68‑ and 0.84‑percentage‑point losses), achieves a 7.77‑fold reduction in system‑level parameters and a 3.01‑fold reduction in peak VRAM (19.52GB ‑> 6.49GB), and reaches 97.34% top‑1 accuracy with 91.67% macro‑F1 on nine‑class pig behaviour classification. The pipeline fits inside an NVIDIA Jetson Orin NX 16GB envelope with 4.9GB of headroom, supporting a proposed ‑‑ but not yet empirically validated ‑‑ on‑device embedding‑pool re‑identification mechanism whose per‑individual footprint of approximately 94MB per animal per year produces a longitudinal visual record amenable to retrospective association with disease, lameness, reproductive, and growth outcome labels.

Abstract:
One of the most exciting applications of vision models involve pixel‑level reasoning. Despite the abundance of vision foundation models, we still lack representations that effectively embed spatio‑temporal properties of visual scenes at the pixel level. Existing frameworks either train on image‑based pretext tasks, which do not account for dynamic elements, or on video sequences for action‑level reasoning, which does not scale to dense pixel‑level prediction. We present a framework that learns pixel‑accurate feature descriptors from videos, LILA. The core element of our training framework is linear in‑context learning. LILA leverages spatio‑temporal cue maps ‑‑ depth and motion ‑‑ estimated with off‑the‑shelf networks. Despite the noisy nature of those cues, LILA trains effectively on uncurated video datasets, embedding semantic and geometric properties in a temporally consistent manner. We demonstrate compelling empirical benefits of the learned representation across a diverse suite of vision tasks: video object segmentation, surface normal estimation and semantic segmentation.

Abstract:
Hyperspectral imaging (HSI) semantic segmentation typically relies on in‑domain training, but limited data availability often restricts model performance in real‑world applications. Current approaches to leverage foundation models in proximal sensing use cross‑modality techniques, bridging RGB and HSI to exploit vision foundation models. However, these methods either discard spectral information or introduce architectural complexity. We propose cross‑domain transfer as an alternative, reusing HSI foundation models ‑ originally trained in remote sensing ‑ for proximal sensing applications. By eliminating the need to bridge modality gaps, our approach preserves spectral information while maintaining a simple architecture. Using the HS3‑Bench benchmark, we systematically evaluate and compare conventional in‑domain, in‑modality training, cross‑modality transfer and cross‑domain transfer strategies. Our results demonstrate that cross‑domain transfer achieves large performance improvements over in‑domain, in‑modality training, reduces the performance gap to cross‑modality approaches and maintains strong performance in limited data settings. Thus, this work advances more effective HSI semantic segmentation in diverse applications.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) in remote sensing images is a promising task that employs textual descriptions for identifying undefined land cover categories. Despite notable advances, existing methods typically employ a static inference paradigm, overlooking the distinct distribution of each scene, resulting in semantic ambiguity in diverse land covers and incomplete foreground activation. Motivated by this, we propose Seeking Consensus, termed SeeCo, a plug‑and‑play framework to boost the performance of training‑free OVSS models in remote sensing images, which recalibrates arbitrary OVSS models on‑the‑fly by seeking dual consensus: geometric consensus learning (GCL) through multi‑view consistent observations and semantic consensus learning (SCL) via textual description adaptive calibration, which assists collaborative recalibration of visual and textual semantics. The two consensus are injected via an online consensus injector (OCI), effectively alleviating the under‑activation and semantic bias. SeeCo requires no specific training process, yet recalibrates semantic‑geometric alignment for each unique scene during inference. Extensive experiments on eight remote sensing OVSS benchmarks show consistent gains, proving its effectiveness and universality.

Authors: Chang Liu, Henghui Ding, Nikhila Ravi, Yunchao Wei, Shuting He, Song Bai, Philip Torr, Leilei Cao, Jinrong Zhang, Deshui Miao, Xusheng He, Dengxian Gong, Zhiyu Wang, Mingqi Gao, Jihwan Hong, Canyang Wu, Weili Guan, Jianlong Wu, Liqiang Nie, Xingsen Huang, Yameng Gu, Xiaogang Yu, Xin Li, Ming-Hsuan Yang, Sijie Li, Jungong Han, Quanzhu Niu, Shihao Chen, Yuanzheng Wu, Yikang Zhou, Tao Zhang, Haobo Yuan, Lu Qi, Shunping Ji, Chao Yang, Chao Tian, Guoqing Zhu, Kai Yang, Zhifan Mo, Haijun Zhang, Xudong Kang, Shutao Li, Jaeyoung Do

Abstract:
This report summarizes the objectives, datasets, and top‑performing methodologies of the 2026 Pixel‑level Video Understanding in the Wild (PVUW) Challenge, hosted at CVPR 2026, which evaluates state‑of‑the‑art models under highly unconstrained conditions. To provide a comprehensive assessment, the 2026 edition features three specialized tracks: the MOSE track for tracking objects within densely cluttered and severely occluded scenarios; the MeViS‑Text track for localizing targets via motion‑focused linguistic expressions; and the newly inaugurated MeViS‑Audio track, which pioneers acoustic‑driven object segmentation. By introducing previously unreleased challenging data and analyzing the cutting‑edge, multimodal solutions submitted by participants, this report highlights the community's latest technical advancements and charts promising future directions for robust video scene comprehension.

Abstract:
Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand‑crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per‑iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration‑based comparisons are misleading: when wall‑clock compute is matched, canonical logit‑ and feature‑based KD outperform recent segmentation‑specific methods. Under extended training, feature‑based distillation achieves state‑of‑the‑art ResNet‑18 performance on Cityscapes and ADE20K. A PSPNet ResNet‑18 student closely approaches its ResNet‑101 teacher despite using only one quarter of the parameters, reaching 99% of the teacher's mIoU on Cityscapes (79.0 vs 79.8) and 92% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task‑specific mechanisms and suggest that scaling, rather than complex hand‑crafted objectives, should guide future method design.

Abstract:
Open‑vocabulary semantic segmentation requires assigning pixel‑level semantic labels while supporting an open and unrestricted set of categories. Training‑free CLIP‑based approaches preserve strong zero‑shot generalization but typically rely on a single inference mechanism, limiting their ability to jointly address unreliable local tokens and insufficient spatial coherence. We propose DouC, a training‑free dual‑branch CLIP framework that decomposes dense prediction into two complementary components. OG‑CLIP improves patch‑level reliability via lightweight, inference‑time token gating, while FADE‑CLIP injects external structural priors through proxy attention guided by frozen vision foundation models. The two branches are fused at the logit level, enabling local token reliability and structure‑aware patch interactions to jointly influence final predictions, with optional instance‑aware correction applied as post‑processing. DouC introduces no additional learnable parameters, requires no retraining, and preserves CLIP's zero‑shot generalization. Extensive experiments across eight benchmarks and multiple CLIP backbones demonstrate that DouC consistently outperforms prior training‑free methods and scales favorably with model capacity.

Abstract:
Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two‑dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three‑dimensional scene geometry from monocular drone video with open‑vocabulary 2D instance segmentation to enable species‑agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter‑animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi‑animal scenes and substantially reduces manual 3D annotation effort through keyframe‑based refinement. By transforming standard drone footage into structured 3D and viewpoint‑aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.

Abstract:
Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text‑conditioned semantic and open‑vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general‑purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground‑truth mask into the latent space and concatenates them as conditioning signals for the diffusion U‑Net. A parallel CLIP‑aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off‑the‑shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state‑of‑the‑art performance on standard semantic segmentation benchmarks, as well as strong open‑vocabulary generalization and cross‑domain transfer to medical, remote sensing, and agricultural scenarios‑without domain‑specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.

Abstract:
Automatic generation of Building Information Models (BIM) from building scans is a key challenge in architecture and construction. We present a modular pipeline for generating IFC‑compliant BIM from 3D point clouds. The hybrid approach combines learning‑based semantic segmentation with topology‑aware geometric reconstruction to model structural elements accurately. We propose vIoU, adapting voxel‑based overlap evaluation to Scan‑to‑BIM by enabling holistic, instance‑matching‑free comparison of reconstructed and ground‑truth models. We release the German Hospital dataset (DeKH), including high‑resolution point clouds, ground truth BIMs, and semantic annotations. Experiments on DeKH and CV4AEC datasets show significant improvements over a RANSAC‑based baseline, demonstrating robustness and scalability.

Abstract:
Instance‑sensitive losses for semantic segmentation such as blob loss and CC loss were designed to address instance imbalance, ensuring small lesions generate the same gradient as large ones, but operate only on single‑class segmentation. In multi‑class settings, class imbalance poses an additional problem: rare classes with few instances receive a disproportionately small share of the training signal. We show that extending instance‑sensitive losses to multi‑class segmentation via a one‑vs‑rest class decomposition repurposes them to also address class imbalance, as uniform averaging over classes ensures each class contributes equally regardless of frequency. We further show that inverse‑size weighting, which destabilizes training when applied globally due to weight imbalances across rare and common classes, becomes effective when integrated within the per‑component loss, confining the reweighting to each component's spatial context. On the BraTS‑METS 2025 dataset (260 test cases), multi‑class CC loss improves foreground Dice (0.64 +/‑ 0.26 vs. 0.59 +/‑ 0.27 baseline) and rare‑class Dice, while maintaining Panoptic Quality at DSC threshold 0.5. Multi‑class blob loss achieves the best Panoptic Quality at threshold 0.5 (0.40 +/‑ 0.24 vs. 0.38 +/‑ 0.25 baseline) and recognition quality (0.53 +/‑ 0.29 vs. 0.49 +/‑ 0.30). Integrating inverse‑size weighting within the per‑component loss increases rare‑class Dice to 0.44 +/‑ 0.36 at the cost of reduced detection quality.

Abstract:
Containerised shipping underpins global trade, yet container loss at sea remains a persistent safety, environmental, and economic challenge. Despite compliance with Cargo Securing Manuals, dynamic maritime conditions such as vessel motion, wind loading, and severe sea states can progressively destabilise container stacks, leading to overboard losses. With the new International Maritime Organisation's (IMO) mandatory reporting requirements for lost containers, there is an urgent need for a reliable, evidence‑based early detection solution for destabilised containers. This study showcases a low‑cost, retrofittable computer vision‑based system for early detection of destabilised containers using existing onboard cameras. The framework integrates object segmentation to isolate container stacks, temporal object tracking using optical flow and individual objects' residual motion extraction to quantify relative movement. Experimental evaluation on real onboard ship footage demonstrates that the proposed pipeline effectively isolates container‑level motion under challenging conditions of varying sea states and visibility conditions. By enabling early alerts for crew intervention and navigational adjustment, the proposed approach enhances cargo safety, operational resilience, and regulatory compliance.

Abstract:
Audio‑based video object segmentation aims to locate and segment objects in videos conditioned on audio cues, requiring precise understanding of both appearance and motion. Recent audio‑driven video segmentation methods extend MLLMs by fusing audio and visual features for end‑to‑end localization. Despite their promise, these approaches are computationally intensive, struggle with aligning temporal audio cues to dynamic video content, and depend on large paired audio‑video datasets. To address these challenges, we present ASR‑SaSaSa2VA, a resource‑efficient framework for audio‑guided video segmentation. The key idea is to convert audio inputs into textual motion descriptions via automatic speech recognition (ASR) models and then leverage pre‑trained text‑based referring video segmentation models (e.g., SaSaSa2VA) for pixel‑level predictions. To further enhance robustness, we incorporate a no‑target expression detection module, implemented by a fine‑tuned audio‑based MLLM, which filters out audio clips that do not refer to any target object. This design allows the system to exploit strong pre‑trained models while effectively handling ambiguous or irrelevant audio inputs. Our approach achieves a final score of 80.7 in the 5th PVUW Challenge (MeViS‑v2‑Audio track), earning the second‑place ranking.

Abstract:
Indoor environments lack the spatial intelligence infrastructure that GPS provides outdoors; first responders arriving at unfamiliar buildings typically have no machine‑readable map of safety equipment. Prior work on 3D semantic segmentation for public safety identified two barriers: scarcity of labeled indoor training data and poor recognition of small safety‑critical features by native point‑cloud methods. This paper presents INSIGHT, a zero‑target‑domain‑annotation pipeline that projects 2D image understanding into 3D metric space via registered RGB‑D data. Two interchangeable vision stacks share a common 3D back end: a SAM3 foundation‑model stack for text‑prompted segmentation, and a traditional CV stack (open‑set detection, VQA, OCR) whose intermediate outputs are independently inspectable. Evaluated on all seven subareas of Stanford 2D‑3D‑S (70,496 images), the pipeline produces Pointcept‑schema‑compatible labeled point clouds and ISO~19164‑compliant scene graphs with ～10^4× compression; role‑filtered payloads transmit in <15\,s at 1\,Mbps over FirstNet Band~14. We report per‑point labeling accuracy on 7 shared classes, detection sensitivity for 15 safety‑critical classes absent from public 3D benchmarks alongside code‑capped deployable estimates, and inter‑pipeline complementarity, demonstrating that 2D‑to‑3D semantic transfer addresses the labeled‑data bottleneck while scene graphs provide building intelligence compact enough for field deployment.

Abstract:
WeatherSeg, an advanced semi‑supervised segmentation framework, addresses autonomous driving's environmental perception challenges in adverse weather while reducing annotation costs. This framework integrates a Dual Teacher‑Student Weight‑Sharing Model (DTSWSM) that enables knowledge distillation from weather‑affected images, and a Classifier Weight Updating Attention Mechanism (CWUAM) that dynamically adjusts classifier weights based on environmental attributes. Comprehensive evaluations demonstrate that WeatherSeg significantly outperforms baseline models in both accuracy and robustness across various weather conditions, including clear, rainy, cloudy, and foggy scenarios, establishing it as an effective solution for all‑weather semantic segmentation in autonomous driving and related applications.

Abstract:
Multimodal large language models (MLLMs) have advanced static visual‑‑spatial reasoning, yet they often fail to preserve long‑horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large‑scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action‑conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high‑fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three‑level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end‑to‑end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short‑term (step‑wise) updates and long‑term (episodic) reconstruction. Benchmarking representative open‑source VLM families reveals a consistent stacked bottleneck: coordinate‑consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text‑based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long‑horizon episodic integration.

Abstract:
Human visual perception offers valuable insights for understanding computational principles of motion‑based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low‑level motion cues and high‑level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware‑accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt‑inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object‑level scene understanding. This work thus establishes a general framework for motion‑based perception grounded in principles of human vision.

Abstract:
Accurate quantification of the physical exposure area of beach litter, rather than simple item counts, is essential for credible ecological risk assessment of marine debris. However, automated UAV‑based monitoring predominantly relies on bounding‑box detection, which systematically overestimates the planar area of irregular litter objects. To address this geometric limitation, we develop PLAS‑Net (Pixel‑level Litter Area Segmentor), an instance segmentation framework that extracts pixel‑accurate physical footprints of coastal debris. Evaluated on UAV imagery from a monsoon‑driven pocket beach in Koh Tao, Thailand, PLAS‑Net achieves a mAP_50 of 58.7% with higher precision than eleven baseline models, demonstrating improved mask fidelity under complex coastal conditions. To illustrate how the accuracy of the masking affects the conclusions of environmental analysis, we conducted three downstream demonstrations: (i) power‑law fitting of normalized plastic density (NPD) to characterize fragmentation dynamics; (ii) area‑weighted ecological risk index (ERI) to map spatial pollution hotspots; and (iii) source composition analysis revealing the abundance‑area paradox: fishing gear constitutes a small proportion of the total number of items, but has the largest physical area per unit item. Pixel‑level area extraction can provide more valuable information for coastal monitoring compared to methods based solely on counting.

Abstract:
While current deep learning models achieve high performance by learning statistical correlations from vast datasets,which stands in stark contrast to human learning. They lack the flexibility of humans‑particularly preverbal infants‑to autonomously acquire the underlying structure of the world from limited experience and adapt to novel situations. In this study, we propose an unsupervised representation learning method based on a hierarchical relationship in group operations, rather than statistical independence, aiming to build a computational model of the cognitive development of infants. The proposed model features an integrated architecture that simultaneously performs object segmentation and the extraction of motion laws from dynamic image sequences. By introducing the Homomorphism from algebra as a structural constraint within a neural network, the model structurally separates pixel‑level changes into meaningful, decomposed transformation components, such as translation and deformation. Using interaction scenes (chasing and evading tasks) based on developmental science findings, we experimentally demonstrate that the model can segment multiple objects into individual slots without any ground‑truth labels. Furthermore, we confirmed that relative movements between objects, such as approaching or receding, are accurately mapped and structured into a one‑dimensional additive latent space. These results suggest that by introducing algebraic geometric constraints rather than relying solely on statistical correlation learning, physically interpretable "disentangled representations" can be acquired. This study contributes to the understanding of the process by which infants internalize environmental laws as structures and provides a new perspective for constructing artificial systems with developmental intelligence.

Abstract:
Amodal segmentation is a challenging task that aims to predict the complete geometric shape of objects, including their occluded regions. Although existing methods primarily focus on amodal segmentation within the training domain, these approaches often lack the generalization capacity to extend effectively to novel object categories and unseen contexts. This paper introduces Amodal SAM, a unified framework that leverages SAM (Segment Anything Model) for both amodal image and amodal video segmentation. Amodal SAM preserves the powerful generalization ability of SAM while extending its inherent capabilities to the amodal segmentation task. The improvements lie in three aspects: (1) a lightweight Spatial Completion Adapter that enables occluded region reconstruction, (2) a Target‑Aware Occlusion Synthesis (TAOS) pipeline that addresses the scarcity of amodal annotations by generating diverse synthetic training data, and (3) novel learning objectives that enforce regional consistency and topological regularization. Extensive experiments demonstrate that Amodal SAM achieves state‑of‑the‑art performance on standard benchmarks, while simultaneously exhibiting robust generalization to novel scenarios. We anticipate that this research will advance the field toward practical amodal segmentation systems capable of operating effectively in unconstrained real‑world environments.

Abstract:
Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image‑level differences, leaving fine‑grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question‑answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change‑specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question‑answering benchmark designed explicitly for such fine‑grained reasoning‑based supervision. To construct RSRCC, we introduce a hierarchical semi‑supervised curation pipeline that uses Best‑of‑N ranking as a critical final ambiguity‑resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image‑text embedding model, and finally validated through retrieval‑augmented vision‑language curation with Best‑of‑N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

Abstract:
Vision Mamba, as a state space model (SSM), employs a zero‑order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM‑based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first‑order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher‑order hold (HOH), and the fourth‑order Runge‑Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training‑time computation. In contrast, the BIL provides consistent improvements over ZOH with modest additional overhead, offering the most favorable trade‑off between precision and efficiency. These findings elucidate the pivotal role of discretization in SSM‑based vision architectures and furnish empirically grounded justification for adopting BIL as the default discretization baseline for state‑of‑the‑art SSM models.

Abstract:
Open‑vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi‑stage 2D+3D pipelines aggregate foundation‑model outputs at hundreds of seconds per scene, while pseudo‑labeled end‑to‑end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal‑free space‑curve transformer that runs at 0.14 seconds per scene, 2‑3 orders of magnitude faster than multi‑stage 2D+3D pipelines. We pair it with SpaCeFormer‑3M, the largest open‑vocabulary 3D instance segmentation dataset (3.0M multi‑view‑consistent captions over 604K instances from 7.4K scenes) built through multi‑view mask clustering and multi‑view VLM captioning; it reaches 21x higher mask recall than prior single‑view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton‑curve serialization for spatially coherent features, and uses a RoPE‑enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero‑shot mAP, a 2.8x improvement over the prior best proposal‑free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi‑view 2D inputs.

Abstract:
Micro‑scale street‑level economic assessment is fundamental for precision spatial resource allocation. While Street View Imagery (SVI) advances urban sensing, existing approaches remain semantically superficial and overlook brand hierarchy heterogeneity and structural recession. To address this, we propose a visual‑semantic and field‑based spatiotemporal framework, operationalized via the Street Economic Vitality Index (SEVI). Our approach integrates physical and semantic streetscape parsing through instance segmentation of signboards, glass interfaces, and storefront closures. A dual‑stage VLM‑LLM pipeline standardizes signage into global hierarchies to quantify a spatially smoothed brand premium index. To overcome static SVI limitations, we introduce a temporal lag design using Location‑Based Services (LBS) data to capture realized demand. Combined with a category‑weighted Gaussian spillover model, we construct a three‑dimensional diagnostic system covering Commercial Activity, Spatial Utilization, and Physical Environment. Experiments based on time‑lagged geographically weighted regression across eight tidal periods in Nanjing reveal quasi‑causal spatiotemporal heterogeneity. Street vibrancy arises from interactions between hierarchical brand clustering and mall‑induced externalities. High‑quality interfaces show peak attraction during midday and evening, while structural recession produces a lagged nighttime repulsion effect. The framework offers evidence‑based support for precision spatial governance.

Abstract:
SAM3 advances open‑vocabulary semantic segmentation by introducing a prompt‑driven mask generation paradigm. However, in multi‑class open‑vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter‑class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra‑class drift that exacerbates inter‑class conflicts and compromises overall inference stability. To address these issues, we propose CoCo‑SAM3 (Concept‑Conflict SAM3), which explicitly decouples inference into intra‑class enhancement and inter‑class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter‑class competition on a unified comparable scale, enabling direct pixel‑wise comparisons among all candidate classes. This mechanism stabilizes multi‑class inference and effectively mitigates inter‑class conflicts. Without requiring any additional training, CoCo‑SAM3 achieves consistent improvements across eight open‑vocabulary semantic segmentation benchmarks.

Abstract:
Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self‑attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome this, we introduce a data‑efficient training recipe based on strong 3D augmentations, regularization, and distillation from a convolutional teacher, making Volt competitive with state‑of‑the‑art methods. We then scale supervision through joint training on multiple datasets and show that Volt benefits more from increased scale than domain‑specific 3D backbones, achieving state‑of‑the‑art results across indoor and outdoor datasets. Finally, when used as a drop‑in backbone in a standard 3D instance segmentation pipeline, Volt again sets a new state of the art, highlighting its potential as a simple, scalable, general‑purpose backbone for 3D scene understanding.

Abstract:
This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm‑3DPS), aiming to improve generalization under domain shifts commonly encountered in real‑world autonomous driving. A straightforward solution is to employ a pseudo‑labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm‑3DPS backbone. However, existing supervised mm‑3DPS methods rely heavily on strong cross‑modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo‑labeling typically retains only high‑confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single‑sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo‑label completeness and reliability, we further develop a dual‑expert pseudo‑label refinement module that extracts domain‑invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state‑of‑the‑art UDA baselines for 3D semantic segmentation.

Abstract:
We present an approach for enhancing non‑playable characters (NPCs) in games by combining large language models (LLMs) with computer vision to provide contextual awareness of their surroundings. Conventional NPCs typically rely on pre‑scripted dialogue and lack spatial understanding, which limits their responsiveness to player actions and reduces overall immersion. Our method addresses these limitations by capturing panoramic images of an NPC's environment and applying semantic segmentation to identify objects and their spatial positions. The extracted information is used to generate a structured JSON representation of the environment, combining object locations derived from segmentation with additional scene graph data within the NPC's bounding sphere, encoded as directional vectors. This representation is provided as input to the LLM, enabling NPCs to incorporate spatial knowledge into player interactions. As a result, NPCs can dynamically reference nearby objects, landmarks, and environmental features, leading to more believable and engaging gameplay. We describe the technical implementation of the system and evaluate it in two stages. First, an expert interview was conducted to gather feedback and identify areas for improvement. After integrating these refinements, a user study was performed, showing that participants preferred the context‑aware NPCs over a non‑context‑aware baseline, confirming the effectiveness of the proposed approach.

Abstract:
Frame‑wise semantic segmentation of indoor lidar scans is a fundamental step toward higher‑level 3D scene understanding and mapping applications. However, acquiring frame‑wise ground truth for training deep learning models is costly and time‑consuming. This challenge is largely addressed, for imagery, by Visual Foundation Models (VFMs) which segment image frames. The same VFMs may be used to train a lidar scan frame segmentation model via a 2D‑to‑3D distillation pipeline. The success of such distillation has been shown for autonomous driving scenes, but not yet for indoor scenes. Here, we study the feasibility of repeating this success for indoor scenes, in a frame‑wise distillation manner by coupling each lidar scan with a VFM‑processed camera image. The evaluation is done using indoor SLAM datasets, where pseudo‑labels are used for downstream evaluation. Also, a small manually annotated lidar dataset is provided for validation, as there are no other lidar frame‑wise indoor datasets with semantics. Results show that the distilled model achieves up to 56% mIoU under pseudo‑label evaluation and around 36% mIoU with real‑label, demonstrating the feasibility of cross‑modal distillation for indoor lidar semantic segmentation without manual annotations.

Abstract:
Visual state‑space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial‑Mamba, for remote‑sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4‑stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra‑family scaling yields only modest gains, cross‑domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy‑efficiency trade‑offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness‑oriented design and boundary‑aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba‑based segmentation backbones

Abstract:
This report presents an Audio‑aware Referring Video Object Segmentation (Ref‑VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA‑based Ref‑VOS pipeline, the proposed system introduces two additional front‑end stages: speech transcription and visual existence verification. Specifically, we first employ VibeVoice‑ASR to convert long‑form spoken input into a structured textual transcript. Since audio‑derived queries are inherently noisy and may describe entities that are not visually present in the video, we then introduce an Omni‑based judgment module to determine whether the transcribed target can be grounded in the visual content. If the target is judged to be absent, the pipeline terminates early and outputs all‑zero masks. Otherwise, the transcript is transformed into a segmentation‑oriented prompt and fed into Sa2VA to obtain a coarse mask trajectory over the full video. Importantly, this trajectory is treated as an initial semantic hypothesis rather than a final prediction. On top of it, an agentic refinement layer evaluates query reliability, temporal relevance, anchor quality, and potential error sources, and may invoke SAM3 to improve spatial boundary precision and temporal consistency. The resulting framework explicitly decomposes the MEVIS\_Audio task into audio‑to‑text conversion, visual existence verification, coarse video segmentation, and agent‑guided refinement. Such a staged design is substantially more appropriate for audio‑conditioned Ref‑VOS than directly sending noisy ASR outputs into a segmentation model.

Abstract:
In recent years, the Vision Transformer (ViT) has garnered significant attention within the computer vision community. However, the core component of ViT, Self‑Attention, lacks explicit spatial priors and suffers from quadratic computational complexity, limiting its applicability. To address these issues, we have proposed RMT, a robust vision backbone with explicit spatial priors for general purposes. RMT utilizes Manhattan distance decay to introduce spatial information and employs a horizontal and vertical decomposition attention method to model global information. Building on the strengths of RMT, Euclidean enhanced Vision Transformer (EVT) is an expanded version that incorporates several key improvements. Firstly, EVT uses a more reasonable Euclidean distance decay to enhance the modeling of spatial information, allowing for a more accurate representation of spatial relationships compared to the Manhattan distance used in RMT. Secondly, EVT abandons the decomposed attention mechanism featured in RMT and instead adopts a simpler spatially‑independent grouping approach, providing the model with greater flexibility in controlling the number of tokens within each group. By addressing these modifications, EVT offers a more sophisticated and adaptable approach to incorporating spatial priors into the Self‑Attention mechanism, thus overcoming some of the limitations associated with RMT and further enhancing its applicability in various computer vision tasks. Extensive experiments on Image Classification, Object Detection, Instance Segmentation, and Semantic Segmentation demonstrate that EVT exhibits exceptional performance. Without additional training data, EVT achieves 86.6% top1‑acc on ImageNet‑1k.

Abstract:
Synthetic Aperture Radar (SAR) plays a critical role in maritime surveillance, yet deep learning for SAR analysis is limited by the lack of pixel‑level annotations. This paper explores how general‑purpose vision foundation models can enable zero‑shot ship instance segmentation in SAR imagery, eliminating the need for pixel‑level supervision. A YOLOv11‑based detector trained on open SAR datasets localizes ships via bounding boxes, which then prompt the Segment Anything Model 2 (SAM2) to produce instance masks without any mask annotations. Unlike prior SAM‑based SAR approaches that rely on fine tuning or adapters, our method demonstrates that spatial constraints from a SAR‑trained detector alone can effectively regularize foundation model predictions. This design partially mitigates the optical‑SAR domain gap and enables downstream applications such as vessel classification, size estimation, and wake analysis. Experiments on the SSDD benchmark achieve a mean IoU of 0.637 (89% of a fully supervised baseline) with an overall ship detection rate of 89.2%, confirming a scalable, annotation‑efficient pathway toward foundation‑model‑driven SAR image understanding.

Abstract:
Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor‑intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel‑level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN‑based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self‑supervised pre‑training on unlabeled individual images. Experimental results on real‑world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R‑CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE‑based architecture enable high‑precision segmentation with requiring less manual annotations for individual plankton images.

Abstract:
Unrecovered e‑waste represents a significant economic loss. Hard disk drives (HDDs) comprise a valuable e‑waste stream necessitating robotic disassembly. Automating the disassembly of HDDs requires holistic 3D sensing, scene understanding, and fastener localization, however current methods are fragmented, lack robust 3D sensing, and lack fastener localization. We propose an autonomous vision pipeline which performs 3D sensing using a Fringe Projection Profilometry (FPP) module, with selective triggering of a depth completion module where FPP fails, and integrates this module with a lightweight, real‑time instance segmentation network for scene understanding and critical component localization. By utilizing the same FPP camera‑projector system for both our depth sensing and component localization modules, our depth maps and derived 3D geometry are inherently pixel‑wise aligned with the segmentation masks without registration, providing an advantage over RGB‑D perception systems common in industrial sensing. We optimize both our trained depth completion and instance segmentation networks for deployment‑oriented inference. The proposed system achieves a box mAP@50 of 0.960 and mask mAP@50 of 0.957 for instance segmentation, while the selected depth completion configuration with the Depth Anything V2 Base backbone achieves an RMSE of 2.317 mm and MAE of 1.836 mm; the Platter Facing learned inference stack achieved a combined latency of 12.86 ms and a throughput of 77.7 Frames Per Second (FPS) on the evaluation workstation. Finally, we adopt a sim‑to‑real transfer learning approach to augment our physical dataset. The proposed perception pipeline provides both high‑fidelity semantic and spatial data which can be valuable for downstream robotic disassembly. The synthetic dataset developed for HDD instance segmentation will be made publicly available.

Abstract:
Gaussian Splatting has recently become one of the most popular frameworks for photorealistic 3D scene reconstruction and rendering. While current rasterizers allow for efficient mappings of 3D Gaussian splats onto 2D camera views, this work focuses on mapping 2D image information (e.g. color, neural features or segmentation masks) efficiently back onto an existing scene of Gaussian splats. This 'opposite' direction enables applications ranging from scene relighting and stylization to 3D semantic segmentation, but also introduces challenges, such as view‑dependent colorization and occlusion handling. Our approach tackles these challenges using the normal equation to solve a visibility‑weighted least squares problem for every Gaussian and can be implemented efficiently with existing differentiable rasterizers. We demonstrate the effectiveness of our approach on scene relighting, feature enrichment and 3D semantic segmentation tasks, achieving up to an order of magnitude speedup compared to gradient descent‑based baselines.

Abstract:
Interactive video segmentation models such as SAM2 have demonstrated strong generalization across diverse visual domains. However, under weak user supervision, for example, when sparse point prompts are provided on a single frame, their predictions often suffer from temporal instability, including flickering boundaries, object dropout, and inconsistent object extents across frames. These issues limit their reliability in downstream video understanding and control applications. In this paper, we propose an inference‑time temporal probability smoothing method that improves the temporal stability of SAM2‑based video segmentation without retraining or architectural modification. Our approach operates directly on per‑frame segmentation probability maps and leverages optical‑flow‑based motion warping together with pixel‑wise uncertainty estimates derived from segmentation entropy, and forward‑backwards flow consistency. These signals are used to adaptively blend current‑frame predictions with motion‑aligned historical estimates, yielding temporally coherent segmentation outputs under weak prompts. We evaluate the proposed method on four diverse video sequences using a comprehensive set of frame‑wise and temporal stability metrics, including motion‑compensated IoU, boundary consistency, object persistence, and area volatility. Experimental results demonstrate consistent improvements in temporal stability over vanilla SAM2 inference while preserving spatial accuracy. The proposed framework is lightweight, model‑agnostic, and well‑suited for real‑time, interactive video segmentation.

Abstract:
Fine‑grained semantic segmentation of transmission‑corridor point clouds is fundamental for intelligent power‑line inspection. However, current progress is limited by realistic data scarcity and the difficulty of modeling global corridor structure and local geometric details in long, heterogeneous scenes. Existing public datasets usually provide only a few coarse categories or short cropped scenes which overlook long‑range structural dependencies, severe long‑tail distributions, and subtle distinctions among safety‑critical components. As a result, current methods are difficult to evaluate under realistic inspection settings, and their ability to preserve and integrate complementary global and local cues remains unclear. To address the above challenges, we introduce TowerDataset, a heterogeneous benchmark for transmission‑corridor segmentation. TowerDataset contains 661 real‑world scenes and about 2.466 billion points. It preserves long corridor extents, defines a fine‑grained 22‑class taxonomy, and provides standardized splits and evaluation protocols. In addition, we present a global‑local fusion framework which preserves and fuses whole‑scene and local‑detail information. A whole‑scene branch with NoCrop training and prototypical contrastive learning captures long‑range topology and contextual dependencies. A block‑wise local branch retains fine geometric structures. Both predictions are then fused and refined by geometric validation. This design allows the model to exploit both global relationships and local shape details when recognizing rare and confusing components. Experiments on TowerDataset and two public benchmarks demonstrate the challenge of the proposed benchmark and the robustness of our framework in real, complex, and heterogeneous transmission‑corridor scenes. The dataset will be released soon at https://huggingface.co/datasets/tccx18/Towerdataset/tree/main.

Abstract:
This paper proposes a novel privacy‑preserving semantic segmentation method that can use independent keys for each client and image. In the proposed method, the model creator and each client encrypt images using locally generated keys, and model training and inference are conducted on the encrypted images. To mitigate performance degradation, an image encryption method is applied to model training in addition to the generation of test images. In experiments, the effectiveness of the proposed method is confirmed on the Cityscapes dataset under the use of a vision transformer‑based model, called SETR.

Abstract:
Two‑dimensional materials are expected to play an important role in next‑generation electronics and optoelectronic devices. Recently, twisted bilayer graphene and transition metal dichalcogenides have attracted significant attention due to their unique physical properties and potential applications. In this study we describe the use of optical microscopy to collect the color space of chemical vapor deposition (CVD) molybdenum disulfide (\mboxMoS_2), and the application of a semantic segmentation convolutional neural network (CNN) to accurately and rapidly identify thicknesses of \mboxMoS_2 flakes. A second CNN model is trained to provide precise predictions on the twist angle of CVD‑grown bilayer flakes. This model harnessed a dataset comprising over 10,000 synthetic images, encompassing geometries spanning from hexagonal to triangular shapes. Subsequent validation of the deep learning predictions on twist angles was executed through the second harmonic generation and Raman spectroscopy. Our results introduce a scalable methodology for automated inspection of twisted atomically thin CVD‑grown bilayer.

Abstract:
Open‑vocabulary semantic segmentation enables models to segment objects or image regions beyond fixed class sets, offering flexibility in dynamic environments. However, existing methods often rely on single‑view images and struggle with spatial precision, especially under occlusions and near object boundaries. We propose SENSE, the first work on Stereo OpEN Vocabulary SEmantic Segmentation, which leverages stereo vision and vision‑language models to enhance open‑vocabulary semantic segmentation. By incorporating stereo image pairs, we introduce geometric cues that improve spatial reasoning and segmentation accuracy. Trained on the PhraseStereo dataset, our approach achieves strong performance in phrase‑grounded tasks and demonstrates generalization in zero‑shot settings. On PhraseStereo, we show a +2.9% improvement in Average Precision over the baseline method and +0.76% over the best competing method. SENSE also provides a relative improvement of +3.5% mIoU on Cityscapes and +18% on KITTI compared to the baseline work. By jointly reasoning over semantics and geometry, SENSE supports accurate scene understanding from natural language, essential for autonomous robots and Intelligent Transportation Systems.

Abstract:
Segmentation is a critical task in computational pathology, as it identifies areas affected by disease or abnormal growth and is essential for diagnosis and treatment. However, acquiring high‑quality pixel‑level supervised segmentation data requires significant workload demands from experienced pathologists, limiting the application of deep learning. To overcome this challenge, relaxing the label conditions to image‑level classification labels allows for more data to be used and more scenarios to be enabled. One approach is to leverage Class Activation Map (CAM) to generate pseudo pixel‑level annotations for semantic segmentation with only image‑level labels. However, this method fails to thoroughly explore the essential characteristics of pathology images, thus identifying only small areas that are insufficient for pseudo masking. In this paper, we propose a novel shuffle‑based feedback learning method inspired by curriculum learning to generate higher‑quality pseudo‑semantic segmentation masks. Specifically, we perform patch level shuffle of pathology images, with the model adaptively adjusting the shuffle strategy based on feedback from previous learning. Experimental results demonstrate that our proposed approach outperforms state‑of‑the‑arts on three different datasets.

Abstract:
Tristructural isotropic (TRISO)‑coated particle fuels undergo dimensional changes and chemical reactions during high‑temperature neutron irradiation. Post‑irradiation materialography helps understand processes that impact fuel performance, such as coating integrity and fission product retention. Conventionally, experts manually evaluate features in thousands of cross sections of sub‑mm‑sized samples, which is tedious and subjective. In this work, we propose UA‑Net, a deep learning framework that segments five characteristic regions of TRISO fuel micrographs and generates an uncertainty map for predictions. The model uses a multi‑stage pretraining strategy, starting with general image representations learned from ImageNet, followed by fine‑tuning on TRISO micrographs from various irradiation experiments and AGR‑5/6/7 particle cross sections. A meta‑model for uncertainty prediction is integrated to identify small defects in TRISO images. UA‑Net was evaluated on a test set of 102 images, achieving mean Intersection over Union (mIoU) and mean Precision (mP) of 95.5% and 97.3%, respectively. The meta‑model achieved a specificity of 91.8% and sensitivity of 93.5%, demonstrating strong performance in detecting misclassifications. The model was also applied to new TRISO images for qualitative evaluation, showing high accuracy in extracting layer regions.

Abstract:
Grain‑edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time‑consuming, and expert‑annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES and LSS is nontrivial due to 1) severe domain gap induced by extinction‑dependent color variations and ultra‑fine grain boundaries, and 2) lacking novel modules for joint learning on multi‑angle petrographic image stacks. In this paper, we propose Petro‑SAM, a novel two‑stage, multi‑task framework that can achieve high‑quality joint GES and LSS on petrographic images. Specifically, based on SAM, we introduce a Merge Block to integrate seven polarized views, effectively solving the extinction issue. Moreover, we introduce multi‑scale feature fusion and color‑entropy priors to refine the detection.

Abstract:
Myotubes are multinucleated muscle fibers serving as key model systems for studying muscle physiology, disease mechanisms, and drug responses. Mechanistic studies and drug screening thereby rely on quantitative morphological readouts such as diameter, length, and branching degree, which in turn require precise three‑dimensional instance segmentation. Yet established pretrained biomedical segmentation models fail to generalize to this domain due to the absence of large annotated myotube datasets. We introduce a geometry‑driven synthesis pipeline that models individual myotubes via polynomial centerlines, locally varying radii, branching structures, and ellipsoidal end caps derived from real microscopy observations. Synthetic volumes are rendered with realistic noise, optical artifacts, and CycleGAN‑based Domain Adaptation (DA). A compact 3D U‑Net with self‑supervised encoder pretraining, trained exclusively on synthetic data, achieves a mean IPQ of 0.22 on real data, significantly outperforming three established zero‑shot segmentation models, demonstrating that biophysics‑driven synthesis enables effective instance segmentation in annotation‑scarce biomedical domains.

Abstract:
Recent advances in unsupervised video object segmentation have highlighted the potential of two‑stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we introduce cross‑modality token modulation, a novel approach designed to strengthen the interaction between appearance and motion cues. Our method establishes dense connections between tokens from each modality, enabling efficient intra‑modal and inter‑modal information propagation through relation transformer blocks. To improve learning efficiency, we incorporate a token masking strategy that addresses the limitations of relying solely on increased model complexity. Our approach achieves state‑of‑the‑art performance across all public benchmarks, outperforming existing methods.

Abstract:
Accurate detection and segmentation of glomeruli in kidney tissue are essential for diagnostic applications. Traditional deep learning methods primarily rely on semantic segmentation, which often fails to precisely delineate adjacent glomeruli. To address this challenge, we propose a novel glomerulus detection and segmentation model that emphasises boundary separation. Leveraging pathology foundation models, the proposed U‑Net‑based architecture incorporates a specialised attention decoder designed to highlight critical regions and improve instancelevel segmentation. Experimental evaluations demonstrate that our approach surpasses state‑of‑the‑art methods in both Dice score and Intersection over Union, indicating superior performance in glomerular delineation.

Abstract:
We address the challenge of synthetic‑to‑real transfer in forestry perception where real data have only coarse Tree labels while synthetic data provide fine‑grained trunk/crown annotations. We introduce MGTD, a mixed‑granularity dataset with 53k synthetic and 3.6k real images, and a four‑stage protocol isolating domain shift and granularity mismatch. Our core contribution is granularity‑aware distillation, which transfers structural priors from fine‑grained synthetic teachers to a coarse‑label student via logit‑space merging and mask unification. Experiments show consistent mask AP gains, especially for small/distant trees, establishing a testbed for Sim‑Real transfer under label granularity constraints.

Abstract:
Tuning of gate‑defined semiconductor quantum dots (QDs) is a major bottleneck for scaling spin qubit technologies. We present a deep learning (DL) driven, semantic‑segmentation pipeline that performs charge auto‑tuning by locating transition lines in full charge stability diagrams (CSDs) and returns gate voltage targets for the single charge regime. We assemble and manually annotate a large, heterogeneous dataset of 1015 experimental CSDs measured from silicon QD devices, spanning nine design geometries, multiple wafers, and fabrication runs. A U‑Net style convolutional neural network (CNN) with a MobileNetV2 encoder is trained and validated through five‑fold group cross validation. Our model achieves an overall offline tuning success of 80.0% in locating the single‑charge regime, with peak performance exceeding 88% for some designs. We analyze dominant failure modes and propose targeted mitigations. Finally, wide‑range diagram segmentation also naturally enables scalable physic‑based feature extraction that can feed back to fabrication and design workflows and outline a roadmap for real‑time integration in a cryogenic wafer prober. Overall, our results show that neural network (NN) based wide‑diagram segmentation is a practical step toward automated, high‑throughput charge tuning for silicon QD qubits.

Abstract:
Semantic segmentation of histopathology images under class imbalance is typically addressed through frequency‑based loss reweighting, which implicitly assumes that rare classes are difficult. However, true difficulty also arises from morphological variability, boundary ambiguity, and contextual similarity‑factors that frequency cannot capture. We propose Dynamic Focal Attention (DFA), a simple and efficient mechanism that learns class‑specific difficulty directly within the cross‑attention of query‑based mask decoders. DFA introduces a learnable per‑class bias to attention logits, enabling representation‑level reweighting prior to prediction rather than gradient‑level reweighting after prediction. Initialised from a log‑frequency prior to prevent gradient starvation, the bias is optimised end‑to‑end, allowing the model to adaptively capture difficulty signals through training, effectively unifying frequency‑based and difficulty‑aware approaches under a common attention‑bias framework. On three histopathology benchmarks (BDSA, BCSS, CRAG), DFA consistently improves Dice and IoU, matching or exceeding a difficulty‑aware baseline without a separate estimator or additional training stage. These results demonstrate that encoding class difficulty at the representation level provides a principled alternative to conventional loss reweighting for imbalanced segmentation.

Abstract:
Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop‑offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry‑based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision‑making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision‑Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open‑vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop‑pad is occupied or unsafe, the proposed See&Say also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that See&Say outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM‑guided segmentation‑depth fusion for advancing safe and practical drone‑based package delivery.

Abstract:
Vision‑language models have been key to the development of open‑vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back‑project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross‑attends across vision‑language descriptors from multiple viewpoints and fuses them into a unified per‑3D‑instance embedding. As a second contribution, we leverage multiview consistency as a self‑supervision signal for this fusion, which significantly improves performance when added to a standard supervised target‑class loss. Our Cross‑Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single‑view descriptor selection, but also achieves state‑of‑the‑art results on 3D semantic and instance classification benchmarks, including zero‑shot evaluations on out‑of‑domain datasets.

Abstract:
The exponential growth of data from modern radio telescopes presents a significant challenge to traditional single‑pulse search algorithms, which are computationally intensive and prone to high false‑positive rates due to Radio Frequency Interference (RFI). In this work, we introduce FRTSearch, an end‑to‑end framework unifying the detection and physical characterization of Fast Radio Transients (FRTs). Leveraging the morphological universality of dispersive trajectories in time‑frequency dynamic spectra, we reframe FRT detection as a pattern recognition problem governed by the cold plasma dispersion relation. To facilitate this, we constructed CRAFTS‑FRT, a pixel‑level annotated dataset derived from the Commensal Radio Astronomy FAST Survey (CRAFTS), comprising 2,392 instances across diverse source classes. This dataset enables the training of a Mask R‑CNN model for precise trajectory segmentation. Coupled with our physics‑driven IMPIC algorithm, the framework maps the geometric coordinates of segmented trajectories to directly infer the Dispersion Measure (DM) and Time of Arrival (ToA). Benchmarking on the FAST‑FREX dataset shows that FRTSearch achieves a 98.0% recall, competitive with exhaustive search methods, while reducing false positives by over 99.9% compared to PRESTO and delivering a processing speedup of up to 13.9×. Furthermore, the framework demonstrates robust cross‑facility generalization, detecting all 19 tested FRBs from the ASKAP survey without retraining. By shifting the paradigm from ``search‑then‑identify'' to ``detect‑and‑infer,'' FRTSearch provides a scalable, high‑precision solution for real‑time discovery in the era of petabyte‑scale radio astronomy.

Abstract:
Training multimodal large language models (MLLMs) for video understanding requires large‑scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real‑world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and diverse supervision. Our framework supports multiple task formats within a single pipeline, enabling scalable and consistent data creation across tasks. To further enhance reasoning ability, we introduce a VQA‑based fine‑tuning strategy that trains models to answer structured questions about visual content rather than relying solely on captions or simple instructions. This formulation encourages deeper visual grounding and reasoning. We evaluate our approach in three challenging tasks: video object counting, video‑based visual question answering, and video object segmentation. Experimental results demonstrate that models trained predominantly on synthetic data generalize effectively to real‑world datasets, often outperforming traditionally trained counterparts. Our findings highlight the potential of unified synthetic data pipelines as a scalable alternative to expensive real‑world annotation for multimodal video understanding.

Abstract:
LiDAR semantic segmentation plays a pivotal role in 3D scene understanding for edge applications such as autonomous driving. However, significant challenges remain for real‑world deployments, particularly for on‑device post‑deployment adaptation. Real‑world environments can shift as the system navigates through different locations, leading to substantial performance degradation without effective and timely model adaptation. Furthermore, edge systems operate under strict computational and energy constraints, making it infeasible to adapt conventional segmentation models (based on large neural networks) directly on‑device. To address the above challenges, we introduce HyperLiDAR, the first lightweight, post‑deployment LiDAR segmentation framework based on Hyperdimensional Computing (HDC). The design of HyperLiDAR fully leverages the fast learning and high efficiency of HDC, inspired by how the human brain processes information. To further improve the adaptation efficiency, we identify the high data volume per scan as a key bottleneck and introduce a buffer selection strategy that focuses learning on the most informative points. We conduct extensive evaluations on two state‑of‑the‑art LiDAR segmentation benchmarks and two representative devices. Our results show that HyperLiDAR outperforms or achieves comparable adaptation performance to state‑of‑the‑art segmentation methods, while achieving up to a 13.8x speedup in retraining.

Abstract:
Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross‑modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross‑modal fusion from the perspective of modality reliability and propose a novel framework termed the Reliability‑aware Self‑Gated State Space Model (RSGMamba). At the core of our method is the Reliability‑aware Self‑Gated Mamba Block (RSGMB), which explicitly models modality reliability and dynamically regulates cross‑modal interactions through a self‑gating mechanism. Unlike conventional fusion strategies that indiscriminately exchange information across modalities, RSGMB enables reliability‑aware feature selection and enhancing informative feature aggregation. In addition, a lightweight Local Cross‑Gated Modulation (LCGM) is incorporated to refine fine‑grained spatial details, complementing the global modeling capability of RSGMB. Extensive experiments demonstrate that RSGMamba achieves state‑of‑the‑art performance on both RGB‑D and RGB‑T semantic segmentation benchmarks, resulting 58.8% / 54.0% mIoU on NYUDepth V2 and SUN‑RGBD (+0.4% / +0.7% over prior best), and 61.1% / 88.9% mIoU on MFNet and PST900 (up to +1.6%), with only 48.6M parameters, thereby validating the effectiveness and superiority of the proposed approach.

Abstract:
Existing cell instance segmentation pipelines typically combine deterministic predictions with post‑processing, which imposes limited explicit constraints on the global structure of instance masks. In this work, we propose a multi‑task image‑to‑image Schrödinger Bridge framework that formulates instance segmentation as a distribution‑based image‑to‑image generation problem. Boundary‑aware supervision is integrated through a reverse distance map, and deterministic inference is employed to produce stable predictions. Experimental results on the PanNuke dataset demonstrate that the proposed method achieves competitive or superior performance without relying on SAM pre‑training or additional post‑processing. Additional results on the MoNuSeg dataset show robustness under limited training data. These findings indicate that Schrödinger Bridge‑based image‑to‑image generation provides an effective framework for cell instance segmentation.

Abstract:
Multimodal perception systems for robotics and embodied AI often assume reliable RGB‑D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross‑modal adaptation module that synthesizes a task‑driven geometric prompt from RGB alone for the fourth channel of a frozen RGB‑D semantic segmentation model, without depth supervision. We further introduce GeomPrompt‑Recovery, an adaptation module that compensates for degraded depth by predicting the fourth channel correction relevant for the frozen segmenter. Both modules are trained solely with downstream segmentation supervision, enabling recovery of the geometric prior useful for segmentation, rather than estimating depth signals. On SUN RGB‑D, GeomPrompt improves over RGB‑only inference by +6.1 mIoU on DFormer and +3.0 mIoU on GeminiFusion, while remaining competitive with strong monocular depth estimators. For degraded depth, GeomPrompt‑Recovery consistently improves robustness, yielding gains up to +3.6 mIoU under severe depth corruptions. GeomPrompt is also substantially more efficient than monocular depth baselines, reaching 7.8 ms latency versus 38.3 ms and 71.9 ms. These results suggest that task‑driven geometric prompting is an efficient mechanism for cross‑modal compensation under missing and degraded depth inputs in RGB‑D perception.

Abstract:
Reasoning video object segmentation predicts pixel‑level masks in videos from natural‑language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real‑world deployments that require strictly causal, frame‑by‑frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame‑level causal annotations and referent‑shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning categories. We further propose a baseline with continually‑updated segmentation prompts and a structured temporal token reservoir for long‑horizon reasoning under bounded computation. Experiments show that existing methods struggle under strict causality and referent shifts, while our baseline establishes a strong foundation for future research.

Abstract:
Semantic segmentation requires dense pixel‑level annotations, which are costly and time‑consuming to acquire. To address this, we present SeSAM, a framework that uses a foundational segmentation model, i.e. Segment Anything Model (SAM), with weak labels, including coarse masks, scribbles, and points. SAM, originally designed for instance‑based segmentation, cannot be directly used for semantic segmentation tasks. In this work, we identify specific challenges faced by SAM and determine appropriate components to adapt it for class‑based segmentation using weak labels. Specifically, SeSAM decomposes class masks into connected components, samples point prompts along object skeletons, selects SAM masks using weak‑label coverage, and iteratively refines labels using pseudo‑labels, enabling SAM‑generated masks to be effectively used for semantic segmentation. Integrated with a semi‑supervised learning framework, SeSAM balances ground‑truth labels, SAM‑based pseudo‑labels, and high‑confidence pseudo‑labels, significantly improving segmentation quality. Extensive experiments across multiple benchmarks and weak annotation types show that SeSAM consistently outperforms weakly supervised baselines while substantially reducing annotation cost relative to fine supervision.

Abstract:
Semantic segmentation of 3D point cloud scenes is a crucial task for various applications. In real‑world scenarios, training segmentation models often faces three concurrent forms of data insufficiency: scarcity of training scenes, scarcity of point‑level annotations, and absence of 2D image sequences from which point clouds were reconstructed. Existing data‑efficient algorithms typically address only one or two of these challenges, leaving the joint treatment of all three unexplored. This paper proposes a data‑efficient training framework specifically designed to address the three forms of data insufficiency. Our proposed algorithm, called Point pseudo‑Labeling via Open‑Vocabulary Image Segmentation (PLOVIS), leverages an Open‑Vocabulary Image Segmentation (OVIS) model as a pseudo label generator to compensate for the lack of training data. PLOVIS creates 2D images for pseudo‑labeling directly from training 3D point clouds, eliminating the need for 2D image sequences. To mitigate the inherent noise and class imbalance in pseudo labels, we introduce a two‑stage filtering of pseudo labels combined with a class‑balanced memory bank for effective training. The two‑stage filtering mechanism first removes low‑confidence pseudo labels, then discards likely incorrect pseudo labels, thereby enhancing the quality of pseudo labels. Experiments on four benchmark datasets, i.e., ScanNet, S3DIS, Toronto3D, and Semantic3D, under realistic data‑scarce conditions (a few tens of training 3D scenes, each annotated with only <100 3D points) demonstrate that PLOVIS consistently outperforms existing methods including standard fine‑tuning strategies and state‑of‑the‑art weakly supervised learning algorithms. Code will be made publicly available.

Abstract:
Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre‑trained Image Semantic Segmentation (ISS) models frame‑by‑frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high‑quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation‑assisted Test‑Time Adaptation), a novel framework that converts an ISS model into a temporally‑aware VSS model through efficient test‑time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single‑pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross‑frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero‑shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA's effectiveness, achieving competitive or superior performance relative to fully‑supervised VSS methods, thus providing a practical and annotation‑free solution for real‑world VSS tasks.

Abstract:
Instance segmentation enables the analysis of spatial and temporal properties of cells in microscopy images by identifying the pixels belonging to each cell. However, progress is constrained by the scarcity of high‑quality labeled microscopy datasets. Many recent approaches address this challenge by initializing models with segmentation‑pretrained weights from large‑scale natural‑image models such as Segment Anything Model (SAM). However, representations learned from natural images often encode objectness and texture priors that are poorly aligned with microscopy data, leading to degraded performance under domain shift. We propose DINOCell, a self‑supervised framework for cell instance segmentation that leverages representations from DINOv2 and adapts them to microscopy through continued self‑supervised training on unlabeled cell images prior to supervised fine‑tuning. On the LIVECell benchmark, DINOCell achieves a SEG score of 0.784, improving by 10.42% over leading SAM‑based models, and demonstrates strong zero‑shot performance on three out‑of‑distribution microscopy datasets. These results highlight the benefits of domain‑adapted self‑supervised pretraining for robust cell segmentation.

Abstract:
Precise segmentation of objects with highly similar shapes remains a challenging problem in dense prediction, especially in scenarios with ambiguous boundaries, overlapping instances, and weak inter‑instance visual differences. While conventional segmentation models are effective at localizing object regions, they often lack the discriminative capacity required to reliably distinguish a target object from morphologically similar distractors. In this work, we study fine‑grained object segmentation from an identity‑aware perspective and propose Identity‑Aware U‑Net (IAU‑Net), a unified framework that jointly models spatial localization and instance discrimination. Built upon a U‑Net‑style encoder‑decoder architecture, our method augments the segmentation backbone with an auxiliary embedding branch that learns discriminative identity representations from high‑level features, while the main branch predicts pixel‑accurate masks. To enhance robustness in distinguishing objects with near‑identical contours or textures, we further incorporate triplet‑based metric learning, which pulls target‑consistent embeddings together and separates them from hard negatives with similar morphology. This design enables the model to move beyond category‑level segmentation and acquire a stronger capability for precise discrimination among visually similar objects. Experiments on benchmarks including cell segmentation demonstrate promising results, particularly in challenging cases involving similar contours, dense layouts, and ambiguous boundaries.

Abstract:
Detecting unseen anomalies in unstructured environments presents a critical challenge for industrial and agricultural applications such as material recycling and weeding. Existing perception systems frequently fail to satisfy the strict operational requirements of these domains, specifically real‑time processing, pixel‑level segmentation precision, and robust accuracy, due to their reliance on exhaustively annotated datasets. To address these limitations, we propose a weakly supervised pipeline for object segmentation and classification using weak image‑level supervision called 'Patch Aggregation for Segmentation of Targets and Anomalies' (PASTA). By comparing an observed scene with a nominal reference, PASTA identifies Target and Anomaly objects through distribution analysis in self‑supervised Vision Transformer (ViT) feature spaces. Our pipeline utilizes semantic text‑prompts via the Segment Anything Model 3 to guide zero‑shot object segmentation. Evaluations on a custom steel scrap recycling dataset and a plant dataset demonstrate a 75.8% training time reduction of our approach to domain‑specific baselines. While being domain‑agnostic, our method achieves superior Target (up to 88.3% IoU) and Anomaly (up to 63.5% IoU) segmentation performance in the industrial and agricultural domain.

Abstract:
Just Recognizable Difference (JRD) boosts coding efficiency for machine vision through visibility threshold modeling, but is currently limited to a single‑task scenario. To address this issue, we propose a Multi‑Task JRD (MT‑JRD) dataset and an Attribute‑assisted MT‑JRD (AMT‑JRD) model for Video Coding for Machines (VCM), enhancing both prediction accuracy and coding efficiency. First, we construct a dataset comprising 27,264 JRD annotations from machines, supporting three representative tasks including object detection, instance segmentation, and keypoint detection. Secondly, we propose the AMT‑JRD prediction model, which integrates Generalized Feature Extraction Module (GFEM) and Specialized Feature Extraction Module (SFEM) to facilitate joint learning across multiple tasks. Thirdly, we innovatively incorporate object attribute information into object‑wise JRD prediction through the Attribute Feature Fusion Module (AFFM), which introduces prior knowledge about object size and location. This design effectively compensates for the limitations of relying solely on image features and enhances the model's capacity to represent the perceptual mechanisms of machine vision. Finally, we apply the AMT‑JRD model to VCM, where the accurately predicted JRDs are applied to reduce the coding bit rate while preserving accuracy across multiple machine vision tasks. Extensive experimental results demonstrate that AMT‑JRD achieves precise and robust multi‑task prediction with a mean absolute error of 3.781 and error variance of 5.332 across three tasks, outperforming the state‑of‑the‑art single‑task prediction model by 6.7% and 6.3%, respectively. Coding experiments further reveal that compared to the baseline VVC and JPEG, the AMT‑JRD‑based VCM improves an average of 3.861% and 7.886% Bjontegaard Delta‑mean Average Precision (BD‑mAP), respectively.

Abstract:
Accurate segmentation of surgical instruments in robotic‑assisted surgery is critical for enabling context‑aware computer‑assisted interventions, such as tool tracking, workflow analysis, and autonomous decision‑making. In this study, we benchmark five deep learning architectures‑UNet, UNet, DeepLabV3, Attention UNet, and SegFormer on the SAR‑RARP50 dataset for multi‑class semantic segmentation of surgical instruments in real‑world radical prostatectomy videos. The models are trained with a compound loss function combining Cross Entropy and Dice loss to address class imbalance and capture fine object boundaries. Our experiments reveal that while convolutional models such as UNet and Attention UNet provide strong baseline performance, DeepLabV3 achieves results comparable to SegFormer, demonstrating the effectiveness of atrous convolution and multi‑scale context aggregation in capturing complex surgical scenes. Transformer‑based architectures like SegFormer further enhance global contextual understanding, leading to improved generalization across varying instrument appearances and surgical conditions. This work provides a comprehensive comparison and practical insights for selecting segmentation models in surgical AI applications, highlighting the trade‑offs between convolutional and transformer‑based approaches.

Abstract:
Semantic segmentation of satellite imagery plays a vital role in land cover mapping and environmental monitoring. However, annotating large‑scale, high‑resolution satellite datasets is costly and time consuming, especially when covering vast geographic regions. Instead of randomly labeling data or exhaustively annotating entire datasets, Active Learning (AL) offers an efficient alternative by intelligently selecting the most informative samples for annotation with the help of Human‑in‑the‑loop (HITL), thereby reducing labeling costs while maintaining high model performance. AL is particularly beneficial for large‑scale or resource‑constrained satellite applications, as it enables high segmentation accuracy with significantly fewer labeled samples. Despite these advantages, standard AL strategies typically rely on global uncertainty or diversity measures and lack the adaptability to target underperforming or rare classes as training progresses, leading to bias in the system. To overcome these limitations, we propose a novel adaptive acquisition function, Dynamic Class‑Aware Uncertainty based Active learning (DCAU‑AL) that prioritizes sample selection based on real‑time class‑wise performance gaps, thereby overcoming class‑imbalance issue. The proposed DCAU‑AL mechanism continuously tracks the performance of the segmentation per class and dynamically adjusts the sampling weights to focus on poorly performing or underrepresented classes throughout the active learning process. Extensive experiments on the OpenEarth land cover dataset show that DCAU‑AL significantly outperforms existing AL methods, especially under severe class imbalance, delivering superior per‑class IoU and improved annotation efficiency.

Abstract:
Conventional 3D instance segmentation methods rely on labor‑intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi‑view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero‑shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi‑view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse‑to‑fine framework for zero‑shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D‑guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi‑view mask consistency via 3D coverage distributions. Guided by these view‑consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter‑object occlusions, thereby improving the robustness of 3D‑to‑2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods

Abstract:
Training‑free open‑vocabulary semantic segmentation(TF‑OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision‑language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF‑OVSS methods commonly adopt a sliding‑window strategy that processes cropped sub‑images independently. While effective for managing high‑resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV‑Stitcher, a training‑free framework that addresses this limitation by stitching fragmented sub‑image features directly within the final encoder block. By reconstructing attention representations from fragmented sub‑image features, OV‑Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV‑Stitcher establishes a scalable and effective solution for open‑vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training‑free baselines.

Abstract:
360 video object segmentation (360VOS) aims to predict temporally‑consistent masks in 360 videos, offering full‑scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high‑quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 ‑‑ with its design of memory module ‑‑ shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left‑right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion‑ and memory‑aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user‑friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano‑Aware Decoder with seam‑consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion‑Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long‑Short Memory Module to maintain a compact long‑term object pointer to re‑instantiate and align short‑term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open‑vocabulary prompts, necessitating that existing methods possess pixel‑level vision‑language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time‑consuming iterative training or model‑specific attention modulation. In this work, we propose a more direct approach that eschews the logits‑optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time‑consuming iterative training, freeing us from model‑specific attention modulation, and achieving state‑of‑the‑art performance on eight benchmark datasets.

Abstract:
In semantic segmentation, the creation of pixel‑level labels for training data incurs significant costs. To address this problem, semi‑supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi‑supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo‑labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo‑labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID‑19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi‑supervised learning methods.

Abstract:
Semi‑supervised learning for LiDAR semantic segmentation often suffers from error propagation and confirmation bias caused by noisy pseudo‑labels. To tackle this chronic issue, we introduce RePL, a novel framework that enhances pseudo‑label quality by identifying and correcting potential errors in pseudo‑labels through masked reconstruction, along with a dedicated training strategy. We also provide a theoretical analysis demonstrating the condition under which the pseudo‑label refinement is beneficial, and empirically confirm that the condition is mild and clearly met by RePL. Extensive evaluations on the nuScenes‑lidarseg and SemanticKITTI datasets show that RePL improves pseudo‑label quality a lot and, as a result, achieves the state of the art in LiDAR semantic segmentation.

Abstract:
Remote sensing semantic segmentation requires models that can jointly capture fine spatial details and high‑level semantic context across complex scenes. While classical encoder‑decoder architectures such as U‑Net remain strong baselines, they often struggle to fully exploit global semantics and structured feature interactions. In this work, we propose HQF‑Net, a hybrid quantum‑classical multi‑scale fusion network for remote sensing image segmentation. HQF‑Net integrates multi‑scale semantic guidance from a frozen DINOv3 ViT‑L/16 backbone with a customized U‑Net architecture through a Deformable Multiscale Cross‑Attention Fusion (DMCAF) module. To enhance feature refinement, the framework further introduces quantum‑enhanced skip connections (QSkip) and a Quantum bottleneck with Mixture‑of‑Experts (QMoE), which combines complementary local, global, and directional quantum circuits within an adaptive routing mechanism. Experiments on three remote sensing benchmarks show consistent improvements with the proposed design. HQF‑Net achieves 0.8568 mIoU and 96.87% overall accuracy on LandCover.ai, 71.82% mIoU on OpenEarthMap, and 55.28% mIoU with 99.37% overall accuracy on SeasoNet. An architectural ablation study further confirms the contribution of each major component. These results show that structured hybrid quantum‑classical feature processing is a promising direction for improving remote sensing semantic segmentation under near‑term quantum constraints.

Abstract:
Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end‑to‑end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel‑aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training‑free token aggregation module that forms mutual nearest‑neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather‑based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep‑rate or threshold). The speed‑accuracy trade‑off is set by a discrete insertion schedule. We benchmark end‑to‑end latency on an NVIDIA H100 GPU (with and without FlashAttention‑2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per‑image latency by up to 60% for ViT‑Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention‑2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction‑aware, training‑free token merging can translate into practical wall‑clock gains for segmentation when overhead is explicitly accounted for.

Abstract:
Recent lightweight semantic segmentation methods have made significant progress by combining compact backbones with efficient decoder heads. However, most multi‑scale decoders compute attention independently at each feature scale, introducing substantial redundancy since the resulting attention distributions across scales are strongly correlated. We propose Cross‑Stage Attention Propagation (CSAP), a decoder framework that computes attention at the deepest feature scale and propagates the resulting attention maps to shallower stages, bypassing query‑key computation at those stages entirely. This design preserves multi‑scale contextual reasoning while substantially reducing the decoder's computational cost. CSAP‑Tiny achieves 42.9% mIoU on ADE20K with only 5.5 GFLOPs, 80.5% on Cityscapes with 21.5 GFLOPs, and 40.9% on COCO‑Stuff 164K with 5.5 GFLOPs, surpassing SegNeXt‑Tiny by +1.8% on ADE20K while requiring 16.8% fewer floating‑point operations.

Abstract:
Vision‑Language Models have achieved strong progress in ground‑view visual understanding, yet they remain brittle in high‑altitude Unmanned Aerial Vehicle scenes, where objects are tiny and densely packed, textures are repetitive, and top‑down orientations are ambiguous. We introduce UAVReason, a large‑scale UAV‑native dataset and evaluation suite for studying unified aerial reasoning and generation under this nadir‑view domain shift. UAVReason aligns RGB imagery, depth maps, semantic segmentation masks, captions, and question‑answer pairs within a consistent aerial domain. It contains 23.6K captioned frames, 273K VQA pairs including 68.2K two‑frame temporal questions, and 188.8K cross‑modal generation samples across RGB, depth, and segmentation modalities. We further adapt UAVReason‑Bagel as a unified understanding‑and‑generation baseline that jointly optimizes language reasoning and dense visual generation objectives. Experiments show that general‑purpose VLMs and off‑the‑shelf unified generators struggle with UAV‑native grounding, while UAVReason‑Bagel substantially improves over its pretrained counterpart, increasing VQA‑1F F1 from 0.394 to 0.711, VQA‑2F F1 from 0.427 to 0.822, and heading‑aware VQA F1 from 0.798 to 0.973. For generation, it improves segmentation mIoU to 0.143 and reduces KID from 0.078 to 0.048 for depth‑segmentation‑text‑conditioned RGB synthesis. More importantly, our ablations reveal a bidirectional synergy between synthesis and reasoning. Dense generation objectives improve temporal semantic consistency, while language‑level reasoning regularizes sparse‑condition image synthesis. These results suggest that unified reasoning and generation provide effective geometry‑aware structural priors for physically grounded aerial intelligence. All data, code, and evaluation tools will be released.

Abstract:
Traditional human vision‑centric image compression methods are suboptimal for machine vision centric compression due to different visual properties and feature characteristics. To address this problem, we propose a Channel Importance‑driven learned Image Coding for Machines (CI‑ICM), aiming to maximize the performance of machine vision tasks at a given bitrate constraint. First, we propose a Channel Importance Generation (CIG) module to quantify channel importance in machine vision and develop a channel order loss to rank channels in descending order. Second, to properly allocate bitrate among feature channels, we propose a Feature Channel Grouping and Scaling (FCGS) module that non‑uniformly groups the feature channels based on their importance and adjusts the dynamic range of each group. Based on FCGS, we further propose a Channel Importance‑based Context (CI‑CTX) module to allocate bits among feature groups and to preserve higher fidelity in critical channels. Third, to adapt to multiple machine tasks, we propose a Task‑Specific Channel Adaptation (TSCA) module to adaptively enhance features for multiple downstream machine tasks. Experimental results on the COCO2017 dataset show that the proposed CI‑ICM achieves BD‑mAP@50:95 gains of 16.25% in object detection and 13.72% in instance segmentation over the established baseline codec. Ablation studies validate the effectiveness of each contribution, and computation complexity analysis reveals the practicability of the CI‑ICM. This work establishes feature channel optimization for machine vision‑centric compression, bridging the gap between image coding and machine perception.

Abstract:
Recent advances in vision‑language models (VLMs) trained on web‑scale image‑text pairs have enabled impressive zero‑shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real‑world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet‑A, and natural language‑induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero‑shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language‑induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.

Abstract:
Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual‑linguistic domains is severely bottlenecked by high‑dimensional noise, modality misalignment, and the confounding of shared versus category‑specific topologies. In this paper, we propose Cross‑Modal Graphical Lasso (CM‑GLasso) that overcomes these fundamental limitations. By coupling a novel text‑visualization strategy with a unified vision‑language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross‑attention distillation mechanism that condenses high‑dimensional patches into explicit semantic nodes, naturally extracting spatial‑aware cross‑modal priors. Furthermore, we unify tailored GLasso estimation and Common‑Specific Structure Learning (CSSL) into a joint objective optimized via the Alternating Direction Method of Multiplier (ADMM). This formulation guarantees the simultaneous disentanglement of invariant and class‑specific precision matrices without multi‑step error accumulation. Extensive experiments across eight benchmarks covering both natural and medical domains demonstrate that CM‑GLasso establishes a new state‑of‑the‑art in generative classification and dense semantic segmentation tasks.

Abstract:
Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi‑supervised knowledge distillation (SSKD) framework that compresses pre‑trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per‑pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self‑training with contrastive calibration, (2) knowledge transfer through a unified multi‑objective loss, and (3) student refinement to mitigate residual pseudo‑label bias. Central to our approach is an instance‑aware pixel‑wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter‑instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our \approx 11× smaller student improves over its zero‑shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state‑of‑the‑art SSKD methods on benchmarks.

Abstract:
Foundation models (FM) are reshaping computer vision by reducing reliance on task‑specific supervised learning and leveraging general visual representations learned at scale. In precision livestock farming, most pipelines remain dominated by supervised learning models that require extensive labeled data, repeated retraining, and farm‑specific tuning. This study presents an FM‑centered workflow for automated monitoring of group‑housed nursery pigs, in which pretrained vision‑language FM serve as general visual backbones and farm‑specific adaptation is achieved through modular post‑processing. Grounding‑DINO was first applied to 1,418 annotated images to establish a baseline detection performance. While detection accuracy was high under daytime conditions, performance degraded under night‑vision and heavy occlusion, motivating the integration of temporal tracking logic. Building on these detections, short‑term video segmentation with Grounded‑SAM2 was evaluated on 550 one‑minute video clips; after post‑processing, over 80% of 4,927 active tracks were fully correct, with most remaining errors arising from inaccurate masks or duplicated labels. To support identity consistency over an extended time, we further developed a long‑term tracking pipeline integrating initialization, tracking, matching, mask refinement, re‑identification, and post‑hoc quality control. This system was evaluated on a continuous 132‑minute video and maintained stable identities throughout. On 132 uniformly sampled ground‑truth frames, the system achieved a mean region similarity (J) of 0.83, contour accuracy (F) of 0.92, J&F of 0.87, MOTA of 0.99, and MOTP of 90.7%, with no identity switches. Overall, this work demonstrates how FM prior knowledge can be combined with lightweight, task‑specific logic to enable scalable, label‑efficient, and long‑duration monitoring in pig production.

Abstract:
Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality‑specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross‑modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary‑modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability‑aware cross‑modal interaction within the encoder, while a lightweight Seam‑Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state‑of‑the‑art performance with minimal additional parameters and strong generalization to unseen modality combinations.

Abstract:
Pavement condition assessment is essential for road safety and maintenance. Existing research has made significant progress. However, most studies focus on conventional computer vision tasks such as classification, detection, and segmentation. In real‑world applications, pavement inspection requires more than visual recognition. It also requires quantitative analysis, explanation, and interactive decision support. Current datasets are limited. They focus on unimodal perception. They lack support for multi‑turn interaction and fact‑grounded reasoning. They also do not connect perception with vision‑language analysis. To address these limitations, we introduce PaveBench, a large‑scale benchmark for pavement distress perception and interactive vision‑language analysis on real‑world highway inspection images. PaveBench supports four core tasks: classification, object detection, semantic segmentation, and vision‑language question answering. It provides unified task definitions and evaluation protocols. On the visual side, PaveBench provides large‑scale annotations and includes a curated hard‑distractor subset for robustness evaluation. It contains a large collection of real‑world pavement images. On the multimodal side, we introduce PaveVQA, a real‑image question answering (QA) dataset that supports single‑turn, multi‑turn, and expert‑corrected interactions. It covers recognition, localization, quantitative estimation, and maintenance reasoning. We evaluate several state‑of‑the‑art methods and provide a detailed analysis. We also present a simple and effective agent‑augmented visual question answering framework that integrates domain‑specific models as tools alongside vision‑language models. The dataset is available at: https://huggingface.co/datasets/MML‑Group/PaveBench.

Abstract:
The deep integration of communication with intelligence and sensing, as a defining vision of 6G, renders environment‑aware channel prediction a key enabling technology. As a representative 6G application, vehicular communications require accurate and forward‑looking channel prediction under stringent reliability, latency, and adaptability demands. Traditional empirical and deterministic models remain limited in balancing accuracy, generalization, and deployability, while the growing availability of onboard and roadside sensing devices offers a promising source of environmental priors. This paper proposes an environment‑aware channel prediction framework based on multimodal visual feature fusion. Using GPS data and vehicle‑side panoramic RGB images, together with semantic segmentation and depth estimation, the framework extracts semantic, depth, and position features through a three‑branch architecture and performs adaptive multimodal fusion via a squeeze‑excitation attention gating module. For 360‑dimensional angular power spectrum (APS) prediction, a dedicated regression head and a composite multi‑constraint loss are further designed. As a result, joint prediction of path loss (PL), delay spread (DS), azimuth spread of arrival (ASA), azimuth spread of departure (ASD), and APS is achieved. Experiments on a synchronized urban V2I measurement dataset yield the best root mean square error (RMSE) of 3.26 dB for PL, RMSEs of 37.66 ns, 5.05 degrees, and 5.08 degrees for DS, ASA, and ASD, respectively, and mean/median APS cosine similarities of 0.9342/0.9571, demonstrating strong accuracy, generalization, and practical potential for intelligent channel prediction in 6G vehicular communications.

Abstract:
This paper presents a lightweight, end‑to‑end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real‑world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D‑ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi‑scale feature representation using a Feature Pyramid Network (FPN) and Self‑Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane‑relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real‑time inference in the Autonomous Systems Laboratory at City, St George's University of London. These results suggest that the proposed models are well‑suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).

Abstract:
Open‑vocabulary semantic segmentation in the remote sensing (RS) field requires both language‑aligned recognition and fine‑grained spatial delineation. Although CLIP offers robust semantic generalization, its global‑aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS‑pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP's semantic integrity. To address this limitation, we propose DR‑Seg, a novel decouple‑and‑rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR‑Seg decouples CLIP features into semantics‑dominated and structure‑dominated subspaces, enabling targeted structural enhancement by DINO without distorting language‑aligned semantics. Subsequently, a prior‑driven graph rectification module injects high‑fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty‑guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR‑Seg establishes a new state‑of‑the‑art.

Abstract:
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web‑curated, unlabeled videos to automatically generate training data, to facilitate end‑to‑end models in 3D scene understanding alongside human‑annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low‑level perception, i.e., 3D object detection and instance segmentation, to high‑evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision‑Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero‑shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

Abstract:
Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep‑learning‑based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture‑aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi‑scale feature aggregation. A texture branch summarizes all face‑level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two‑Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural‑heritage dataset comprising textured roof tiles with triangle‑level annotations for damage types. Our method achieves 81.9% mF1 and 94.3% OA on SUM and 49.7% mF1 and 72.8% OA on the new dataset, substantially outperforming existing approaches.

Abstract:
Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point‑to‑Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state‑of‑the‑art crowd instance segmentation performance on ShanghaiTech, UCF‑QNRF, JHU‑CROWD++, and NWPU‑Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.

Abstract:
Open‑set test‑time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out‑of‑distribution (OOD) samples coexist with in‑distribution (ID) samples affected by distribution shifts. In such settings, covariate shift‑for example, changes in weather conditions such as snow‑can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate‑shifted ID (csID) samples but also effectively reject covariate‑shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test‑time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade‑off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature‑norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a \underlinerobust \underlineopen‑\underlineset \underlinetest‑\underlinetime \underlineadaptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR‑10‑C, CIFAR‑100‑C, Tiny‑ImageNet‑C and ImageNet‑C. Furthermore, experiments on the Cityscapes validate the method's effectiveness in real‑world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open‑set TTA setups.

Abstract:
Vision Transformers (ViTs) have demonstrated state‑ofthe‑art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early‑stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense prediction tasks, exceeding the capabilities of conventional image‑based patch pruning, which typically operate around a 30% patch sparsity. VPP excels the high‑sparsity regime, sustaining remarkable performance even when patch usage is reduced below 55%. Specifically, it preserves stable results with a maximal performance drop of 0.6% on the Youtube‑VIS 2021 dataset.

Abstract:
In the Complex Video Object Segmentation task, researchers are required to track and segment specific targets within cluttered environments, which rigorously tests a method's capability for target comprehension and environmental adaptability. Although SAM3, the current state‑of‑the‑art solution, exhibits unparalleled segmentation performance and robustness on conventional targets, it underperforms on tiny and semantic‑dominated objects. The root cause of this limitation lies in SAM3's insufficient comprehension of these specific target types. To address this issue, we propose TEP: Advancing Complex Video Object Segmentation via Tracking‑Enhanced Prompts. As a training‑free approach, TEP leverages external tracking models and Multimodal Large Language Models to introduce tracking‑enhanced prompts, thereby alleviating the difficulty SAM3 faces in understanding these challenging targets. Our method achieved first place (56.91%) on the test set of the PVUW Challenge 2026: Complex Video Object Segmentation Track.

Abstract:
Unsupervised segmentation approaches have increasingly leveraged foundation models (FM) to improve salient object discovery. However, these methods often falter in scenes with complex, multi‑component morphologies, where fine‑grained structural detail is indispensable. Many state‑of‑the‑art unsupervised segmentation pipelines rely on mask discovery approaches that utilize coarse, patch‑level representations. These coarse representations inherently suppress the fine‑grained detail required to resolve such complex morphologies. To overcome this limitation, we propose Excite, Attend and Segment (EASe), an unsupervised domain‑agnostic semantic segmentation framework for easy fine‑grained mask discovery across challenging real‑world scenes. EASe utilizes novel Semantic‑Aware Upsampling with Channel Excitation (SAUCE) to excite low‑resolution FM feature channels for selective calibration and attends across spatially‑encoded image and FM features to recover full‑resolution semantic representations. Finally, EASe segments the aggregated features into multi‑granularity masks using a novel training‑free Cue‑Attentive Feature Aggregator (CAFE) which leverages SAUCE attention scores as a semantic grouping signal. EASe, together with SAUCE and CAFE, operate directly at pixel‑level feature representations to enable accurate fine‑grained dense semantic mask discovery. Our evaluation demonstrates superior performance of EASe over previous state‑of‑the‑arts (SOTAs) across major standard benchmarks and diverse datasets with complex morphologies. Code is available at https://ease‑project.github.io

Abstract:
Marine scene understanding and segmentation plays a vital role in maritime monitoring and navigation safety. However, prevalent factors like fog and strong reflections in maritime environments cause severe image degradation, significantly compromising the stability of semantic perception. Existing restoration and enhancement methods typically target specific degradations or focus solely on visual quality, lacking end‑to‑end collaborative mechanisms that simultaneously improve structural recovery and semantic effectiveness. Moreover, publicly available infrared‑visible datasets are predominantly collected from urban scenes, failing to capture the authentic characteristics of coupled degradations in marine environments. To address these challenges, the Infrared‑Visible Maritime Ship Dataset (IVMSD) is proposed to cover various maritime scenarios under diverse weather and illumination conditions. Building upon this dataset, a Multi‑task Complementary Learning Framework (MCLF) is proposed to collaboratively perform image restoration, multimodal fusion, and semantic segmentation within a unified architecture. The framework includes a Frequency‑Spatial Enhancement Complementary (FSEC) module for degradation suppression and structural enhancement, a Semantic‑Visual Consistency Attention (SVCA) module for semantic‑consistent guidance, and a cross‑modality guided attention mechanism for selective fusion. Experimental results on IVMSD demonstrate that the proposed method achieves state‑of‑the‑art segmentation performance, significantly enhancing robustness and perceptual quality under complex maritime conditions.

Abstract:
The scarcity and high cost of expert annotations in dental imaging present a significant challenge for the development of AI in dentistry. DINOv3, a state‑of‑the‑art, self‑supervised vision foundation model pre‑trained on 1.7 billion images, offers a promising pathway to mitigate this issue. However, its reliability when transferred to the dental domain, with its unique imaging characteristics and clinical subtleties, remains unclear. To address this, we introduce DinoDental, a unified benchmark designed to systematically evaluate whether DINOv3 can serve as a reliable, off‑the‑shelf encoder for comprehensive dental image analysis without requiring domain‑specific pre‑training. Constructed from multiple public datasets, DinoDental covers a wide range of tasks, including classification, detection, and instance segmentation on both panoramic radiographs and intraoral photographs. We further analyze the model's transfer performance by scaling its size and input resolution, and by comparing different adaptation strategies, including frozen features, full fine‑tuning, and the parameter‑efficient Low‑Rank Adaptation (LoRA) method. Our experiments show that DINOv3 can serve as a strong unified encoder for dental image analysis across both panoramic radiographs and intraoral photographs, remaining competitive across tasks while showing particularly clear advantages for intraoral image understanding and boundary‑sensitive dense prediction. Collectively, DinoDental provides a systematic framework for comprehensively evaluating DINOv3 in dental analysis, establishing a foundational benchmark to guide efficient and effective model selection and adaptation for the dental AI community.

Abstract:
Domain Generalized Semantic Segmentation (DGSS) aims to maintain robust performance across unseen target domains. Vision Foundation Models (VFMs) offer rich multi‑domain knowledge that can enhance generalization. However, strategies for actively exploiting the rich subspace structures within VFMs remain under‑explored, with many existing methods focusing primarily on preserving pre‑trained knowledge. Furthermore, their LoRA components often suffer from limited representational diversity and inefficient parameter utilization. We propose RecycleLoRA, which addresses both challenges by employing Rank‑Revealing QR Decomposition (RRQR) to systematically exploit VFM's subspace structures and enhance LoRA's representational richness. Our main adapter leverages minor subspace directions identified by RRQR to learn diverse and independent features, achieving competitive performance even when used alone. We further introduce a sub adapter that carefully refines major directions with minimal adjustments, providing complementary improvements to the main adapter's strong baseline performance. This design enables the dual adapters to learn distinct representations without requiring additional regularization losses. Our systematic exploitation of pre‑trained subspace structures through RRQR‑based initialization leads to superior domain generalization performance. RecycleLoRA achieves state‑of‑the‑art performance on both synthetic‑to‑real generalization and real‑to‑real generalization tasks without complex architectures or additional inference latency.

Abstract:
Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision‑Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large‑scale Vision‑Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi‑strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

Abstract:
Nuclei instance segmentation is critical in computational pathology for cancer diagnosis and prognosis. Recently, the Segment Anything Model has demonstrated exceptional performance in various segmentation tasks, leveraging its rich priors and powerful global context modeling capabilities derived from large‑scale pre‑training on natural images. However, directly applying SAM to the medical imaging domain faces significant limitations: it lacks sufficient perception of the local structural features that are crucial for nuclei segmentation, and full fine‑tuning for downstream tasks requires substantial computational costs. To efficiently transfer SAM's robust prior knowledge to nuclei instance segmentation while supplementing its task‑aware local perception, we propose a parameter‑efficient fine‑tuning framework, named Cooperative Fine‑Grained Refinement of SAM, consisting of three core components: 1) a Multi‑scale Adaptive Local‑aware Adapter, which enables effective capability transfer by augmenting the frozen SAM backbone with minimal parameters and instilling a powerful perception of local structures through dynamically generated, multi‑scale convolutional kernels; 2) a Hierarchical Modulated Fusion Module, which dynamically aggregates multi‑level encoder features to preserve fine‑grained spatial details; and 3) a Boundary‑Guided Mask Refinement, which integrates multi‑context boundary cues with semantic features through explicit supervision, producing a boundary‑focused signal to refine initial mask predictions for sharper delineation. These three components work cooperatively to enhance local perception, preserve spatial details, and refine boundaries, enabling SAM to perform accurate nuclei instance segmentation directly.

Abstract:
Semantic segmentation across arbitrary sensor modalities faces significant challenges due to diverse sensor characteristics, and the traditional configurations for this task result in redundant development efforts. We address these challenges by introducing a universal arbitrary‑modal semantic segmentation framework that unifies segmentation across multiple modalities. Our approach features three key innovations: (1) the Modality‑aware CLIP (MA‑CLIP), which provides modality‑specific scene understanding guidance through LoRA fine‑tuning; (2) Modality‑aligned Embeddings for capturing fine‑grained features; and (3) the Domain‑specific Refinement Module (DSRM) for dynamic feature adjustment. Evaluated on five diverse datasets with different complementary modalities (event, thermal, depth, polarization, and light field), our model surpasses specialized multi‑modal methods and achieves state‑of‑the‑art performance with a mIoU of 65.03%. The codes will be released upon acceptance.

Abstract:
Referring image segmentation aims to localize and segment a target object in an image based on a free‑form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object‑level visual representations, especially when referring expressions involve detailed attributes and complex inter‑object relationships. Existing methods either rely on cross‑modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt‑guided Cross‑modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding‑Spatial Grounding‑Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.

Abstract:
Off‑road semantic segmentation is fundamentally challenged by irregular terrain, vegetation clutter, and inherent annotation ambiguity. Unlike urban scenes with crisp object boundaries, off‑road environments exhibit strong class‑level similarity among terrain categories, resulting in thick and uncertain transition regions that degrade boundary coherence and destabilize training. Rare or thin structures, such as narrow traversable gaps or isolated obstacles, further receive sparse and unreliable supervision and are easily overwhelmed by dominant background textures. Existing decoder designs either rely on low‑scale bottlenecks that oversmooth fine structural details, or repeatedly fuse high‑detail features, which tends to amplify annotation noise and incur substantial computational cost. We present a cross‑scale decoder that explicitly addresses these challenges through three complementary mechanisms. First, a global‑‑local token refinement module consolidates semantic context on a compact bottleneck lattice, guided by boundary‑aware regularization to remain robust under ambiguous supervision. Second, a gated detail bridge selectively injects fine‑scale structural cues only once through cross‑scale attention, preserving boundary and texture information while avoiding noise accumulation. Third, an uncertainty‑guided class‑aware point refinement selectively updates the least reliable pixels, improving rare and ambiguous structures with minimal computational overhead. The resulting framework achieves noise‑robust and boundary‑preserving segmentation tailored to off‑road environments, recovering fine structural details while maintaining deployment‑friendly efficiency. Experimental results on standard off‑road benchmarks demonstrate consistent improvements over prior approaches without resorting to heavy dense feature fusion.

Abstract:
Street‑level imagery contains personally identifiable information (PII), some of which is context‑dependent. Existing anonymization methods either over‑process images or miss subtle identifiers, while API‑based solutions compromise data sovereignty. We present an agentic framework CAIAMAR (\underlineContext‑\underlineAware \underlineImage \underlineAnonymization with \underlineMulti‑\underlineAgent \underlineReasoning) for context‑aware PII segmentation with diffusion‑based anonymization, combining pre‑defined processing for high‑confidence cases with multi‑agent reasoning for indirect identifiers. Three specialized agents coordinate via round‑robin speaker selection in a Plan‑Do‑Check‑Act (PDCA) cycle, enabling large vision‑language models to classify PII based on spatial context (private vs. public property) rather than rigid category rules. The agents implement spatially‑filtered coarse‑to‑fine detection where a scout‑and‑zoom strategy identifies candidates, open‑vocabulary segmentation processes localized crops, and IoU‑based deduplication (30% threshold) prevents redundant processing. Modal‑specific diffusion guidance with appearance decorrelation substantially reduces re‑identification (Re‑ID) risks. On CUHK03‑NP, our method reduces person Re‑ID risk by 73% (R1: 16.9% vs. 62.4% baseline). For image quality preservation on CityScapes, we achieve KID: 0.001, and FID: 9.1, significantly outperforming existing anonymization. The agentic workflow detects non‑direct PII instances across object categories, and downstream semantic segmentation is preserved. Operating entirely on‑premise with open‑source models, the framework generates human‑interpretable audit trails supporting EU's GDPR transparency requirements while flagging failed cases for human review.

Abstract:
Event‑based cameras capture visual information as asynchronous streams of per‑pixel brightness changes, generating sparse, temporally precise data. Compared to conventional frame‑based sensors, they offer significant advantages in capturing high‑speed dynamics while consuming substantially less power. Predicting future event representations from past observations is an important problem, enabling downstream tasks such as future semantic segmentation or object tracking without requiring access to future sensor measurements. While recent state‑of‑the‑art approaches achieve strong performance, they often rely on computationally heavy backbones and, in some cases, large‑scale pretraining, limiting their applicability in resource‑constrained scenarios. In this work, we introduce E‑TIDE, a lightweight, end‑to‑end trainable architecture for event‑tensor prediction that is designed to operate efficiently without large‑scale pretraining. Our approach employs the TIDE module (Temporal Interaction for Dynamic Events), motivated by efficient spatiotemporal interaction design for sparse event tensors, to capture temporal dependencies via large‑kernel mixing and activity‑aware gating while maintaining low computational complexity. Experiments on standard event‑based datasets demonstrate that our method achieves competitive performance with significantly reduced model size and training requirements, making it well‑suited for real‑time deployment under tight latency and memory budgets.

Abstract:
Present‑day deep neural networks for video semantic segmentation require a large number of fine‑grained pixel‑level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state‑of‑the‑art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.

Abstract:
Semantic segmentation of remote sensing imagery is fundamental to Earth observation. Achieving accurate results requires integrating not only optical images but also physical variables such as the Digital Elevation Model (DEM), Synthetic Aperture Radar (SAR) and Normalized Difference Vegetation Index (NDVI). Recent foundation models (FMs) leverage pre‑training to exploit these variables but still depend on spatially aligned data and costly retraining when involving new sensors. To overcome these limitations, we introduce a novel paradigm for integrating domain‑specific physical priors into segmentation models. We first construct a Physical‑Centric Knowledge Graph (PCKG) by prompting large language models to extract physical priors from 1,763 vocabularies, and use it to build a heterogeneous, spatial‑aligned dataset, Phy‑Sky‑SA. Building on this foundation, we develop PriorSeg, a physics‑aware residual refinement model trained with a joint visual‑physical strategy that incorporates a novel physics‑consistency loss. Experiments on heterogeneous settings demonstrate that PriorSeg improves segmentation accuracy and physical plausibility without retraining the FMs. Ablation studies verify the effectiveness of the Phy‑Sky‑SA dataset, the PCKG, and the physics‑consistency loss.

Abstract:
Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion‑centric expressions (referring & reasoning motion expressions) and introducing no‑target queries. Extending SaSaSa2VA, where increased input frames and [SEG] tokens already strengthen the Sa2VA backbone, we adopt a simple yet effective target existence‑aware verification mechanism, leading to Still Awesome SaSaSa2VA (SaSaSaSa2VA). Despite its simplicity, the method achieves a final score of 89.19 in the 5th PVUW Challenge (MeViS‑Text Track), securing 2nd place. Both quantitative results and ablations suggest that this existence‑aware verification strategy is sufficient to unlock strong performance on motion‑centric referring tasks.

Abstract:
Generalized few‑shot semantic segmentation (GFSS) is fundamentally limited by the coverage of novel‑class appearances under scarce annotations. While diffusion models can synthesize novel‑class images at scale, practical gains are often hindered by insufficient coverage and noisy supervision when masks are unavailable or unreliable. We propose Syn4Seg, a generation‑enhanced GFSS framework designed to expand novel‑class coverage while improving pseudo‑label quality. Syn4Seg first maximizes prompt‑space coverage by constructing an embedding‑deduplicated prompt bank for each novel class, yielding diverse yet class‑consistent synthetic images. It then performs support‑guided pseudo‑label estimation via a two‑stage refinement that i) filters low‑consistency regions to obtain high‑precision seeds and ii) relabels uncertain pixels with image‑adaptive prototypes that combine global (support) and local (image) statistics. Finally, we refine only boundary‑band and unlabeled pixels using a constrained SAM‑based update to improve contour fidelity without overwriting high‑confidence interiors. Extensive experiments on PASCAL‑5^i and COCO‑20^i demonstrate consistent improvements in both 1‑shot and 5‑shot settings, highlighting synthetic data as a scalable path for GFSS with reliable masks and precise boundaries.

Abstract:
Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers often produce ill‑conditioned matrices with large condition numbers. This ill‑conditioning is a well‑known obstacle for gradient‑based optimizers, leading to inefficient training. To address this issue, we introduce preconditioned attention, a novel approach that incorporates a conditioning matrix into each attention head. Our theoretical analysis shows that this method significantly reduces the condition number of attention matrices, resulting in better‑conditioned matrices that improve optimization. Conditioned attention serves as a simple drop‑in replacement for a wide variety of attention mechanisms in the literature. We validate the effectiveness of preconditioned attention across a diverse set of transformer applications, including image classification, object detection, instance segmentation, long sequence modeling and language modeling.

Abstract:
Large‑scale maps of field boundaries are essential for agricultural monitoring tasks. Existing deep learning approaches for satellite‑based field mapping are sensitive to illumination, spatial scale, and changes in geographic location. We conduct the first systematic evaluation of segmentation and geospatial foundation models (GFMs) for global field boundary delineation using the Fields of The World (FTW) benchmark. We evaluate 18 models under unified experimental settings, showing that a U‑Net semantic segmentation model outperforms instance‑based and GFM alternatives on a suite of performance and deployment metrics. We propose a new segmentation approach that combines a U‑Net backbone, composite loss functions, and targeted data augmentations to enhance performance and robustness under real‑world conditions. Our model achieves a 76% IoU and 47% object‑F1 on FTW, an increase of 6% and 9% over the previous baseline. Our approach provides a practical framework for reliable, scalable, and reproducible field boundary delineation across model design, training, and inference. We release all models and model‑derived field boundary datasets for five countries.

Abstract:
Screen‑shooting robust watermarking aims to imperceptibly embed extractable information into host images such that the watermark survives the complex distortion pipeline of screen display and camera recapture. However, achieving high extraction accuracy while maintaining satisfactory visual quality remains an open challenge, primarily because the screen‑shooting channel introduces severe and entangled degradations including Moiré patterns, color‑gamut shifts, perspective warping, and sensor noise. In this paper, we present an end‑to‑end deep learning framework that jointly optimizes watermark embedding and extraction for screen‑shooting robustness. Our framework incorporates three key innovations: (i) a comprehensive noise simulation layer that faithfully models realistic screen‑shooting distortions ‑‑ notably including a physically‑motivated Moiré pattern generator ‑‑ enabling the network to learn robust representations against the full spectrum of capture‑channel noise through adversarial training; (ii) a Just Noticeable Distortion (JND) perceptual loss function that adaptively modulates watermark embedding strength by supervising the perceptual discrepancy between the JND coefficient map and the watermark residual, thereby concentrating watermark energy in perceptually insensitive regions to maximize visual quality; and (iii) two complementary automatic localization modules ‑‑ a semantic‑segmentation‑based foreground extractor for captured image rectification and a symmetric noise template mechanism for anti‑cropping region recovery ‑‑ that enable fully automated watermark decoding under realistic deployment conditions. Extensive experiments demonstrate that our method achieves an average PSNR of 30.94~dB and SSIM of 0.94 on watermarked images while embedding 127‑bit payloads.

Abstract:
Semantic segmentation consists of assigning a semantic label to each pixel according to predefined classes. This process facilitates the understanding of object appearance and spatial relationships, playing an important role in the global interpretation of image content. Although modern deep learning approaches achieve high accuracy, they often ignore ordinal relationships among classes, which may encode important domain knowledge for scene interpretation. In this work, loss functions that incorporate ordinal relationships into deep neural networks are investigated to promote greater semantic consistency in semantic segmentation tasks. These loss functions are categorized as unimodal, quasi‑unimodal, and spatial. Unimodal losses constrain the predicted probability distribution according to the class ordering, while quasi‑unimodal losses relax this constraint by allowing small variations while preserving ordinal coherence. Spatial losses penalize semantic inconsistencies between neighboring pixels, encouraging smoother transitions in the image space. In particular, this study adapts loss functions originally proposed for ordinal classification to ordinal semantic segmentation. Among them, the Expanded Mean Squared Error (EXP_MSE), the Quasi‑Unimodal Loss (QUL), and the spatial Contact Surface Loss using Signal Distance Function (CSSDF) are investigated. These approaches have shown promising results in medical imaging, improving robustness, generalization, and anatomical consistency.

Abstract:
Equivariance is a fundamental property in computer vision models, yet strict equivariance is rarely satisfied in real‑world data, which can limit a model's performance. Controlling the degree of equivariance is therefore desirable. We propose a general framework for constructing soft equivariant models by projecting the model weights into a designed subspace. The method applies to any pre‑trained architecture and provides theoretical bounds on the induced equivariance error. Empirically, we demonstrate the effectiveness of our method on multiple pre‑trained backbones, including ViT and ResNet, across image classification, semantic segmentation, and human‑trajectory prediction tasks. Notably, our approach improves the performance while simultaneously reducing equivariance error on the competitive ImageNet benchmark.

Abstract:
High‑fidelity 3D reconstruction of vehicle exteriors improves buyer confidence in online automotive marketplaces, but generating these models in cluttered dealership drive‑throughs presents severe technical challenges. Unlike static‑scene photogrammetry, this setting features a dynamic vehicle moving against heavily cluttered, static backgrounds. This problem is further compounded by wide‑angle lens distortion, specular automotive paint, and non‑rigid wheel rotations that violate classical epipolar constraints. We propose an end‑to‑end pipeline utilizing a two‑pillar camera rig. First, we resolve dynamic‑scene ambiguities by coupling SAM 3 for instance segmentation with motion‑gating to cleanly isolate the moving vehicle, explicitly masking out non‑rigid wheels to enforce strict epipolar geometry. Second, we extract robust correspondences directly on raw, distorted 4K imagery using the RoMa v2 learned matcher guided by semantic confidence masks. Third, these matches are integrated into a rig‑aware SfM optimization that utilizes CAD‑derived relative pose priors to eliminate scale drift. Finally, we use a distortion‑aware 3D Gaussian Splatting framework (3DGUT) coupled with a stochastic Markov Chain Monte Carlo (MCMC) densification strategy to render reflective surfaces. Evaluations on 25 real‑world vehicles across 10 dealerships demonstrate that our full pipeline achieves a PSNR of 28.66 dB, an SSIM of 0.89, and an LPIPS of 0.21 on held‑out views, representing a 3.85 dB improvement over standard 3D‑GS, delivering inspection‑grade interactive 3D models without controlled studio infrastructure.

Abstract:
Incremental open‑vocabulary 3D instance‑semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real‑time processing, and flexible open‑set reasoning. Existing methods often rely on the closed‑set assumption or dense per‑pixel language fusion, which limits scalability and temporal consistency. We introduce OVI‑MAP that decouples instance reconstruction from semantic inference. We propose to build a class‑agnostic 3D instance map that is incrementally constructed from RGB‑D input, while semantic features are extracted only from a small set of automatically selected views using vision‑language models. This design enables stable instance tracking and zero‑shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state‑of‑the‑art open‑vocabulary mapping baselines on standard benchmarks.

Abstract:
Hyperspectral sensing provides rich spectral information for scene understanding in urban driving, but its high dimensionality poses challenges for interpretation and efficient learning. We introduce Learnable Quantum Efficiency (LQE), a physics‑inspired, interpretable dimensionality reduction (DR) method that parameterizes smooth high‑order spectral response functions that emulate plausible sensor quantum efficiency curves. Unlike conventional methods or unconstrained learnable layers, LQE enforces physically motivated constraints, including a single dominant peak, smooth responses, and bounded bandwidth. This formulation yields a compact spectral representation that preserves discriminative information while remaining fully differentiable and end‑to‑end trainable within semantic segmentation models (SSMs). We conduct systematic evaluations across three publicly available multi‑class hyperspectral urban driving datasets, comparing LQE against six conventional and seven learnable baseline DR methods across six SSMs. Averaged across all SSMs and configurations, LQE achieves the highest average mIoU, improving over conventional methods by 2.45%, 0.45%, and 1.04%, and over learnable methods by 1.18%, 1.56%, and 0.81% on HyKo, HSI‑Drive, and Hyperspectral City, respectively. LQE maintains strong parameter efficiency (12‑‑36 parameters compared to 51‑‑22K for competing learnable approaches) and competitive inference latency. Ablation studies show that low‑order configurations are optimal, while the learned spectral filters converge to dataset‑intrinsic wavelength patterns. These results demonstrate that physics‑informed spectral learning can improve both performance and interpretability, providing a principled bridge between hyperspectral perception and data‑driven multispectral sensor design for automotive vision systems.

Abstract:
Task‑oriented grasping (TOG) is more challenging than simple object grasping because it requires precise identification of object parts and careful selection of grasping areas to ensure effective and robust manipulation. While recent approaches have trained large‑scale vision‑language models to integrate part‑level object segmentation with task‑aware grasp planning, their instability in part recognition and grasp inference limits their ability to generalize across diverse objects and tasks. To address this issue, we introduce a novel, geometry‑centric strategy for more generalizable TOG that does not rely on semantic features from visual recognition, effectively overcoming the viewpoint sensitivity of model‑based approaches. Our main proposals include: 1) an object‑part‑task ontology for functional part selection based on intuitive human commands, constructed using a Large Language Model (LLM); 2) a sampling‑based geometric analysis method for identifying the selected object part from observed point clouds, incorporating multiple point distribution and distance metrics; and 3) a similarity matching framework for imitative grasp planning, utilizing similar known objects with pre‑existing segmentation and grasping knowledge as references to guide the planning for unknown targets. We validate the high accuracy of our approach in functional part selection, identification, and grasp generation through real‑world experiments. Additionally, we demonstrate the method's generalization capabilities to novel‑category objects by extending existing ontological knowledge, showcasing its adaptability to a broad range of objects and tasks.

Abstract:
Open‑vocabulary 3D semantic segmentation aims to segment arbitrary categories beyond the training set. Existing methods predominantly rely on distilling knowledge from 2D open‑vocabulary models. However, aligning 3D features to the 2D representation space restricts intrinsic 3D geometric learning and inherits errors from 2D predictions. To address these limitations, we propose GeoGuide, a novel framework that leverages pretrained 3D models to integrate hierarchical geometry‑semantic consistency for open‑vocabulary 3D segmentation. Specifically, we introduce an Uncertainty‑based Superpoint Distillation module to fuse geometric and semantic features for estimating per‑point uncertainty, adaptively weighting 2D features within superpoints to suppress noise while preserving discriminative information to enhance local semantic consistency. Furthermore, our Instance‑level Mask Reconstruction module leverages geometric priors to enforce semantic consistency within instances by reconstructing complete instance masks. Additionally, our Inter‑Instance Relation Consistency module aligns geometric and semantic similarity matrices to calibrate cross‑instance consistency for same‑category objects, mitigating viewpoint‑induced semantic drift. Extensive experiments on ScanNet v2, Matterport3D, and nuScenes demonstrate the superior performance of GeoGuide.

Abstract:
While recent feed‑forward 3D reconstruction models provide a strong geometric foundation for scene understanding, extending them to 3D instance segmentation typically relies on a disjointed "lift‑and‑cluster" paradigm. Grouping dense pixel‑wise embeddings via non‑differentiable clustering scales poorly with the number of views and disconnects representation learning from the final segmentation objective. In this paper, we present a Feed‑forward Anchored Scene Transformer for 3D Instance Segmentation (FAST3DIS), an end‑to‑end approach that effectively bypasses post‑hoc clustering. We introduce a 3D‑anchored, query‑based Transformer architecture built upon a foundational depth backbone, adapted efficiently to learn instance‑specific semantics while retaining its zero‑shot geometric priors. We formulate a learned 3D anchor generator coupled with an anchor‑sampling cross‑attention mechanism for view‑consistent 3D instance segmentation. By projecting 3D object queries directly into multi‑view feature maps, our method samples context efficiently. Furthermore, we introduce a dual‑level regularization strategy, that couples multi‑view contrastive learning with a dynamically scheduled spatial overlap penalty to explicitly prevent query collisions and ensure precise instance boundaries. Experiments on complex indoor 3D datasets demonstrate that our approach achieves competitive segmentation accuracy with significantly improved memory scalability and inference speed over state‑of‑the‑art clustering‑based methods.

Abstract:
Semantic segmentation in marine environments is crucial for the autonomous navigation of unmanned surface vessels (USVs) and coastal Earth Observation events such as oil spills. However, existing methods, often relying on deep CNNs and transformer‑based architectures, face challenges in deployment due to their high computational costs and resource‑intensive nature. These limitations hinder the practicality of real‑time, low‑cost applications in real‑world marine settings. To address this, we propose LEMMA, a lightweight semantic segmentation model designed specifically for accurate remote sensing segmentation under resource constraints. The proposed architecture leverages Laplacian Pyramids to enhance edge recognition, a critical component for effective feature extraction in complex marine environments for disaster response, environmental surveillance, and coastal monitoring. By integrating edge information early in the feature extraction process, LEMMA eliminates the need for computationally expensive feature map computations in deeper network layers, drastically reducing model size, complexity and inference time. LEMMA demonstrates state‑of‑the‑art performance across datasets captured from diverse platforms while reducing trainable parameters and computational requirements by up to 71x, GFLOPs by up to 88.5%, and inference time by up to 84.65%, as compared to existing models. Experimental results highlight its effectiveness and real‑world applicability, including 93.42% IoU on the Oil Spill dataset and 98.97% mIoU on Mastr1325.

Abstract:
Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long‑sequence video datasets. Existing datasets predominantly focus on single‑class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon‑Bench, generated via a novel multi‑stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding‑box tracking, AI‑driven visual confirmation, and human‑in‑the‑loop review to scalably annotate full‑procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon‑Bench to rigorously evaluate state‑of‑the‑art MLLMs across lesion classification, Open‑Vocabulary Video Object Segmentation (OV‑VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM‑3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon‑skill" prompting strategy, improving zero‑shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon‑bench .

Abstract:
Long‑term behavioral monitoring of individual animals is crucial for studying behavioral changes that occur over different time scales, especially for conservation and evolutionary biology. Computer vision methods have proven to benefit biodiversity monitoring, but automated behavior monitoring in wild populations remains challenging. This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re‑identification of wild birds. The CHIRP (Combining beHaviour, Individual Re‑identification and Postures) dataset is curated from a long‑term population of wild Siberian jays studied in Swedish Lapland, supporting re‑identification (re‑id), action recognition, 2D keypoint estimation, object detection, and instance segmentation. In addition to traditional task‑specific benchmarking, we introduce application‑specific benchmarking with biologically relevant metrics (feeding rates, co‑occurrence rates) to evaluate the performance of models in real‑world use cases. Finally, we present CORVID (COlouR‑based Video re‑ID), a novel pipeline for individual identification of birds based on the segmentation and classification of colored leg rings, a widespread approach for visual identification of individual birds. CORVID offers a probability‑based id tracking method by matching the detected combination of color rings with a database. We use application‑specific benchmarking to show that CORVID outperforms state‑of‑the‑art re‑id methods. We hope this work offers the community a blueprint for curating real‑world datasets from ethically approved biological studies to bridge the gap between computer vision research and biological applications.

Abstract:
Trajectory anomaly detection underpins applications from fraud detection to urban mobility analysis. Dense GPS methods preserve fine‑grained evidence such as abnormal speeds and short‑duration events, but their quadratic cost makes multi‑month analysis intractable; consequently, no existing approach detects anomalies over multi‑month dense GPS trajectories. The field instead relies on scalable sparse stay‑point methods that discard this evidence, forcing separate architectures for each regime and preventing knowledge transfer. We argue this bottleneck is unnecessary: human trajectories, dense or sparse, share a natural two‑dimensional cyclic structure along within‑day and across‑day axes. We therefore propose TITAnD (Trajectory Image Transformer for Anomaly Detection), which reformulates trajectory anomaly detection as a vision problem by representing trajectories as a Hyperspectral Trajectory Image (HTI): a day x time‑of‑day grid whose channels encode spatial, semantic, temporal, and kinematic information from either modality, unifying both under a single representation. Under this formulation, agent‑level detection reduces to image classification and temporal localization to semantic segmentation. To model this representation, we introduce the Cyclic Factorized Transformer (CFT), which factorizes attention along the two temporal axes, encoding the cyclic inductive bias of human routines, while reducing attention cost by orders of magnitude and enabling dense multi‑month anomaly detection for the first time. Empirically, TITAnD achieves the best AUC‑PR across sparse and dense benchmarks, surpassing vision models like UNet while being 11‑75x faster than the Transformer with comparable memory, demonstrating that vision reformulation and structure‑aware modeling are jointly essential. Code will be made public soon.

Abstract:
Transformation‑based adversarial attacks (TAAs) demonstrate strong transferability when deceiving classification models. However, existing TAAs often perform unsatisfactorily or even fail when applied to structured tasks such as semantic segmentation and object detection. Encouragingly, recent studies that categorize transformations into non‑spatial and spatial transformations inspire us to address this challenge. We find that for non‑structured tasks, labels are spatially non‑structured, and thus TAAs are not required to adjust labels when applying spatial transformations. In contrast, for structured tasks, labels are spatially structured, and failing to transform labels synchronously with inputs can cause spatial misalignment and yield erroneous gradients. To address these issues, we propose a novel unified Spatial Alignment Framework (SAF) for highly transferable TAAs on spatially structured tasks, where the TAAs spatially transform labels synchronously with the input using the proposed Spatial Alignment (SA) algorithm. Extensive experiments demonstrate the crucial role of our SAF for TAAs on structured tasks. Specifically, in non‑targeted attacks, our SAF degrades the average mIoU on Cityscapes from 24.50 to 11.34, and on Kvasir‑SEG from 49.91 to 31.80, while reducing the average mAP of COCO from 17.89 to 5.25.

Abstract:
Recent advances in self‑supervised learning (SSL) for point clouds have substantially improved 3D scene understanding without human annotations. Existing approaches emphasize semantic awareness by enforcing feature consistency across augmented views or by masked scene modeling. However, the resulting representations transfer poorly to instance localization, and often require full finetuning for strong performance. Instance awareness is a fundamental component of 3D perception, thus bridging this gap is crucial for progressing toward true 3D foundation models that support all downstream tasks on 3D data. In this work, we introduce PointINS, an instance‑oriented self‑supervised framework that enriches point cloud representations through geometry‑aware learning. PointINS employs an orthogonal offset branch to jointly learn high‑level semantic understanding and geometric reasoning, yielding instance awareness. We identify two consistent properties essential for robust instance localization and formulate them as complementary regularization strategies, Offset Distribution Regularization (ODR), which aligns predicted offsets with empirically observed geometric priors, and Spatial Clustering Regularization (SCR), which enforces local coherence by regularizing offsets with pseudo‑instance masks. Through extensive experiments across five datasets, PointINS achieves on average +3.5% mAP improvement for indoor instance segmentation and +4.1% PQ gain for outdoor panoptic segmentation, paving the way for scalable 3D foundation models.

Abstract:
Panoramic semantic segmentation is pivotal for comprehensive 360° scene understanding in critical applications like autonomous driving and virtual reality. However, progress in this domain is constrained by two key challenges: the severe geometric distortions inherent in panoramic projections and the prohibitive cost of dense annotation. While Unsupervised Domain Adaptation (UDA) from label‑rich pinhole‑camera datasets offers a viable alternative, many real‑world tasks impose a stricter source‑free (SFUDA) constraint where source data is inaccessible for privacy or proprietary reasons. This constraint significantly amplifies the core problems of domain shift, leading to unreliable pseudo‑labels and dramatic performance degradation, particularly for minority classes. To overcome these limitations, we propose the DAPASS framework. DAPASS introduces two synergistic modules to robustly transfer knowledge without source data. First, our Panoramic Confidence‑Guided Denoising (PCGD) module generates high‑fidelity, class‑balanced pseudo‑labels by enforcing perturbation consistency and incorporating neighborhood‑level confidence to filter noise. Second, a Contextual Resolution Adversarial Module (CRAM) explicitly addresses scale variance and distortion by adversarially aligning fine‑grained details from high‑resolution crops with global semantics from low‑resolution contexts. DAPASS achieves state‑of‑the‑art performances on outdoor (Cityscapes‑to‑DensePASS) and indoor (Stanford2D3D) benchmarks, yielding 55.04% (+2.05%) and 70.38% (+1.54%) mIoU, respectively.

Abstract:
The learning order of semantic classes significantly impacts unsupervised domain adaptation for semantic segmentation, especially under adverse weather conditions. Most existing curricula rely on handcrafted heuristics (e.g., fixed uncertainty metrics) and follow a static schedule, which fails to adapt to a model's evolving, high‑dimensional training dynamics, leading to category bias. Inspired by Reinforcement Learning, we cast curriculum learning as a sequential decision problem and propose an autonomous class scheduler. This scheduler consists of two components: (i) a high‑dimensional state encoder that maps the model's training status into a latent space and distills key features indicative of progress, and (ii) a category‑fair policy‑gradient objective that ensures balanced improvement across classes. Coupled with mixed source‑target supervision, the learned class rankings direct the network's focus to the most informative classes at each stage, enabling more adaptive and dynamic learning. It is worth noting that our method achieves state‑of‑the‑art performance on three widely used benchmarks (e.g., ACDC, Dark Zurich, and Nighttime Driving) and shows generalization ability in synthetic‑to‑real semantic segmentation.

Abstract:
Existing real‑world super‑resolution (RSR) methods based on generative priors have achieved remarkable progress in producing high‑quality and globally consistent reconstructions. However, they often struggle to recover fine‑grained details of diverse object instances in complex real‑world scenes. This limitation primarily arises because commonly adopted denoising losses (e.g., MSE) inherently favor global consistency while neglecting instance‑level perception and restoration. To address this issue, we propose InstanceRSR, a novel RSR framework that jointly models semantic information and introduces instance‑level feature alignment. Specifically, we employ low‑resolution (LR) images as global consistency guidance while jointly modeling image data and semantic segmentation maps to enforce semantic relevance during sampling. Moreover, we design an instance representation learning module to align the diffusion latent space with the instance latent space, enabling instance‑aware feature alignment, and further incorporate a scale alignment mechanism to enhance fine‑grained perception and detail recovery. Benefiting from these designs, our approach not only generates photorealistic details but also preserves semantic consistency at the instance level. Extensive experiments on multiple real‑world benchmarks demonstrate that InstanceRSR significantly outperforms existing methods in both quantitative metrics and visual quality, achieving new state‑of‑the‑art (SOTA) performance.

Abstract:
Underwater Video Object Segmentation (VOS) is essential for marine exploration, yet open‑air methods suffer significant degradation due to color distortion, low contrast, and prevalent camouflage. A primary hurdle is the lack of high‑quality training data. To bridge this gap, we introduce UW‑VOS, the first large‑scale underwater VOS benchmark comprising 1,431 video sequences across 409 categories with 309,295 mask annotations, constructed via a semi‑automatic data engine with rigorous human verification. We further propose SAM‑U, a parameter‑efficient framework that adapts SAM2 to the underwater domain. By inserting lightweight adapters into the image encoder, SAM‑U achieves state‑of‑the‑art performance with only ～2% trainable parameters. Extensive experiments reveal that existing methods experience an average 13‑point \mathcalJ\&\mathcalF drop on UW‑VOS, while SAM‑U effectively bridges this domain gap. Detailed attribute‑based analysis further identifies small targets, camouflage, and exit‑re‑entry as critical bottlenecks, providing a roadmap for future research in robust underwater perception.

Abstract:
This technical report explores the MOSEv2 track of the PVUW 2026 Challenge, which targets complex semi‑supervised video object segmentation. Built on SAM~3, we develop an automatic re‑prompting framework to improve robustness under target disappearance and reappearance, severe transformation, and strong same‑category distractors. Our method first applies the SAM~3 detector to later frames to identify same‑category object candidates, and then performs DINOv3‑based object‑level matching with a transformation‑aware target feature pool to retrieve reliable target anchors. These anchors are injected back into the SAM~3 tracker together with the first‑frame mask, enabling multi‑anchor propagation rather than relying solely on the initial prompt. This simple directly benefits several core challenges of MOSEv2. Our solution achieves a J&F of 51.17% on the test set, ranking 3rd in the MOSEv2 track.

Abstract:
Audio‑based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel‑level object masks over time, posing challenges in bridging acoustic signals with spatio‑temporal visual representations. In this report, we present VIRST‑Audio, a practical framework built upon a pretrained RVOS model integrated with a vision‑language architecture. Instead of relying on audio‑specific training, we convert input audio into text using an ASR module and perform segmentation using text‑based supervision, enabling effective transfer from text‑based reasoning to audio‑driven scenarios. To improve robustness, we further incorporate an existence‑aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS‑Audio track of the 5th PVUW Challenge, where VIRST‑Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio‑based referring video segmentation.

Abstract:
Sidewalk width is an important indicator of pedestrian accessibility, comfort, and network quality, yet large‑scale width data remain scarce in most cities. Existing approaches typically rely on costly field surveys, high‑resolution overhead imagery, or simplified geometric assumptions that limit scalability or introduce systematic error. To address this gap, we present UrbanVGGT, a measurement pipeline for estimating metric sidewalk width from a single street‑view image. The method combines semantic segmentation, feed‑forward 3D reconstruction, adaptive ground‑plane fitting, camera‑height‑based scale calibration, and directional width measurement on the recovered plane. On a ground‑truth benchmark from Washington, D.C., UrbanVGGT achieves a mean absolute error of 0.252 m, with 95.5% of estimates within 0.50 m of the reference width. Ablation experiments show that metric scale calibration is the most critical component, and controlled comparisons with alternative geometry backbones support the effectiveness of the overall design. As a feasibility demonstration, we further apply the pipeline to three cities and generate SV‑SideWidth, a prototype sidewalk‑width dataset covering 527 OpenStreetMap street segments. The results indicate that street‑view imagery can support scalable generation of candidate sidewalk‑width attributes, while broader cross‑city validation and local ground‑truth auditing remain necessary before deployment as authoritative planning data.

Abstract:
Semantic segmentation metrics for 3D point clouds, such as mean Intersection over Union (mIoU) and Overall Accuracy (OA), present two key limitations in the context of aerial LiDAR data. First, they treat all misclassifications equally regardless of their spatial context, overlooking cases where the geometric severity of errors directly impacts the quality of derived geospatial products such as Digital Terrain Models. Second, they are often dominated by the large proportion of easily classified points, which can mask meaningful differences between models and under‑represent performance in challenging regions. To address these limitations, we propose a novel evaluation framework for comparing semantic segmentation models through two complementary approaches. First, we introduce distance‑based metrics that account for the spatial deviation between each misclassified point and the nearest ground‑truth point of the predicted class, capturing the geometric severity of errors. Second, we propose a focused evaluation on a common subset of hard points, defined as the points misclassified by at least one of the evaluated models, thereby reducing the bias introduced by easily classified points and better revealing differences in model performance in challenging regions. We validate our framework by comparing three state‑of‑the‑art deep learning models on three aerial LiDAR datasets. Results demonstrate that the proposed metrics provide complementary information to traditional measures, revealing spatial error patterns that are critical for Earth Observation applications but invisible to conventional evaluation approaches. The proposed framework enables more informed model selection for scenarios where spatial consistency is critical.

Abstract:
Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real‑world riverine environments. We utilize the geometric and spectral information from the 3‑channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model's generalization in new riverine environments, we additionally investigate multi‑dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full‑feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi‑dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high‑quality annotated data. Our work demonstrates the potential of applying transformer‑based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.

Abstract:
Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state‑of‑the‑art architectures on a large‑scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA‑Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA‑Net trade off segmentation robustness for computational efficiency.

Abstract:
Audio‑Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per‑frame annotations. We introduce Weakly Supervised Audio‑Visual Semantic Segmentation (WSAVSS), which uses only video‑level labels to generate per‑frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross‑modal Alignment for Semantics (PCAS) with two modules: Looking‑before‑Listening and Listening‑before‑Segmentation. PCAS builds a classification task to train the audio‑visual encoder using video labels, injects visual semantic prompts to enhance frame‑level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state‑of‑the‑art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.

Abstract:
We present CataractSAM‑2, a domain‑adapted extension of Meta's Segment Anything Model 2, designed for real‑time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM‑2 enables precise intraoperative perception crucial for robotic‑assisted and computer‑guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video‑based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high‑quality ground‑truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero‑shot generalization to glaucoma trabeculectomy procedures, confirming its cross‑procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open‑source resources, establishing CataractSAM‑2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real‑time AI‑driven solutions in medical robotics, as well as surgical video understanding.

Abstract:
Training‑free open‑vocabulary semantic segmentation (OVSS) promises rapid adaptation to new label sets without retraining. Yet, many methods rely on heavy post‑processing or handle text and vision in isolation, leaving cross‑modal geometry underutilized. Others introduce auxiliary vision backbones or multi‑model pipelines, which increase complexity and latency while compromising design simplicity. We present PEARL, \underlineProcrust\underlinees \underlinealignment with text‑awa\underlinere \underlineLaplacian propagation, a compact two‑step inference that follows an align‑then‑propagate principle. The Procrustes alignment step performs an orthogonal projection inside the last self‑attention block, rotating keys toward the query subspace via a stable polar iteration. The text‑aware Laplacian propagation then refines per‑pixel logits on a small grid through a confidence‑weighted, text‑guided graph solve: text provides both a data‑trust signal and neighbor gating, while image gradients preserve boundaries. In this work, our method is fully training‑free, plug‑and‑play, and uses only fixed constants, adding minimal latency with a small per‑head projection and a few conjugate‑gradient steps. Our approach, PEARL, sets a new state‑of‑the‑art in training‑free OVSS without extra data or auxiliary backbones across standard benchmarks, achieving superior performance under both with‑background and without‑background protocols.

Abstract:
We present a comparative study of methods for generating realistic, constrained small‑ to medium‑scale road networks with built‑in redundancy. In this research, we evaluate the proposed Evolutionary Algorithm (EA) with connectivity and redundancy constraints against the Wave Function Collapse (WFC) method ‑ commonly used in procedural terrain generation for games ‑ and swarm algorithms: Particle Swarm (PSO) and Gray Wolf (GWO). Our focus is on producing realistic, redundant road networks suitable for vision, localization and navigation problems. We evaluate metrics: connectivity, cycles, intersections, dead ends, graph cut‑edges while enforcing physical plausibility. We propose an EA and its extended version with elitism via MAP‑Elites method. We detail the implementation, constraints, metrics and provide both visual and quantitative comparisons with baselines. Results highlight how fitness function design choices affect the structural characteristics of generated networks and highlight the impact of specific constraints in practical applications. Our contribution is a method for creating realistic synthetic datasets from sparse tile definitions derived from real‑world data. We demonstrate a practical application by generating realistic maps using a laboratory‑collected tileset from a Duckietown city model. Our approach performs coherent geometric transformations on metadata, in this work exemplified by semantic segmentation masks of the generated road networks.

Abstract:
Robust semantic segmentation is crucial for safe autonomous driving, yet deployed models remain vulnerable to black‑box adversarial attacks when target weights are unknown. Most existing approaches either craft image‑wide perturbations or optimize patches for a single architecture, which limits their practicality and transferability. We introduce OmniPatch, a training framework for learning a universal adversarial patch that generalizes across images and both ViT and CNN architectures without requiring access to target model parameters.

Abstract:
We present an open‑source robotic framework that integrates computer vision and machine learning based inverse kinematics to enable low‑cost laboratory automation tasks such as colony picking and liquid handling. The system uses a custom trained U‑net model for semantic segmentation of microbial cultures, combined with Mixture Density Network for predicating joint angles of a simple 5‑DOF robot arm. We evaluated the framework using a modified robot arm, upgraded with a custom liquid handling end‑effector. Experimental results demonstrate the framework's feasibility for precise, repeatable operations, with mean positional error below 1 mm and joint angle prediction errors below 4 degrees and colony detection capabilities with IoU score of 0.537 and Dice coefficient of 0.596.

Abstract:
3D instance segmentation methods typically rely on high‑quality point clouds or posed RGB‑D scans, requiring complex multi‑stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed‑forward transformers have revolutionized multi‑view 3D reconstruction, they remain decoupled from high‑level semantic understanding. In this work, we present SegVGGT, a unified end‑to‑end framework that simultaneously performs feed‑forward 3D reconstruction and instance segmentation directly from multi‑view RGB images. By introducing object queries that interact with multi‑level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame‑level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance‑relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state‑of‑the‑art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB‑D‑based approaches, while exhibiting strong generalization capabilities on ScanNet++.

Abstract:
Deep learning underlies most modern approaches and tools in computer vision, including biomedical imaging. However, for interactive semantic segmentation (often called pixel classification in this context) and interactive object‑level classification (object classification), feature‑based shallow learning remains widely used. This is due to the diversity of data in this domain, the lack of large pretraining datasets, and the need for computational and label efficiency. In contrast, state‑of‑the‑art tools for many other vision tasks in microscopy ‑ most notably cellular instance segmentation ‑ already rely on deep learning and have recently benefited substantially from vision foundation models (VFMs), particularly SAM. Here, we investigate whether VFMs can also improve pixel and object classification compared to current approaches. To this end, we evaluate several VFMs, including general‑purpose models (SAM, SAM2, DINOv3) and domain‑specific ones (μSAM, PathoSAM), in combination with shallow learning and attentive probing on five diverse and challenging datasets. Our results demonstrate consistent improvements over hand‑crafted features and provide a clear pathway toward practical improvements. Furthermore, our study establishes a benchmark for VFMs in microscopy and informs future developments in this area.

Abstract:
Open‑Vocabulary Semantic Segmentation (OVSS) assigns pixel‑level labels from an open set of text‑defined categories, demanding reliable generalization to unseen classes at inference. Although modern vision‑language models (VLMs) support strong open‑vocabulary recognition, their representations learned through global contrastive objectives remain suboptimal for dense prediction, prompting many OVSS methods to depend on limited adaptation or refinement of image‑text similarity maps. This, in turn, restricts spatial precision and robustness in complex, cluttered scenes. We introduce dinov3.seg, extending dinov3.txt into a dedicated framework for OVSS. Our contributions are four‑fold. First, we design a task‑specific architecture tailored to this backbone, systematically adapting established design principles from prior open‑vocabulary segmentation work. Second, we jointly leverage text embeddings aligned with both the global [CLS] token and local patch‑level visual features from ViT‑based encoder, effectively combining semantic discrimination with fine‑grained spatial locality. Third, unlike prior approaches that rely primarily on post hoc similarity refinement, we perform early refinement of visual representations prior to image‑text interaction, followed by late refinement of the resulting image‑text correlation features, enabling more accurate and robust dense predictions in cluttered scenes. Finally, we propose a high‑resolution local‑global inference strategy based on sliding‑window aggregation, which preserves spatial detail while maintaining global context. We conduct extensive experiments on five widely adopted OVSS benchmarks to evaluate our approach. The results demonstrate its effectiveness and robustness, consistently outperforming current state‑of‑the‑art methods.

Abstract:
Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ‑VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth‑token objectives (marker, token, and count losses) and a soft‑merging technique for differentiable reconstruction. We adopt a multi‑task co‑training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state‑of‑the‑art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain‑of‑thought materially strengthens spatial grounding in LVLMs.

Abstract:
Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high‑contrast lighting, and limited computational resources. This paper presents a real‑time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120‑meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large‑scale maps to support future lunar surface missions.

Abstract:
Token pruning is essential for enhancing the computational efficiency of vision‑language models (VLMs), particularly for video‑based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision‑language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text‑conditioned token selection mechanisms. In this paper, we introduce Spatio‑Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end‑to‑end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test‑time scaling for long‑video QA further yields performance gains of 0.5‑1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture‑wide vision token pruning.

Abstract:
Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test‑time optimization. However, these VFMs rely on static image pre‑training, which is inherently sub‑optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask‑to‑Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask‑based constraints for weakly‑supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point‑to‑point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2‑B/14 and DINOv3‑B/16 on the TAP‑Vid‑DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre‑trained backbones for both test‑time optimized and offline fine‑tuned TAP tasks, demonstrating its potential to serve as general pre‑trained models for point tracking. Code will be made publicly available upon acceptance.

Abstract:
Multi‑view learning primarily aims to fuse multiple features to describe data comprehensively. Most prior studies implicitly assume that different views share similar dimensions. In practice, however, severe dimensional disparities often exist among different views, leading to the unbalanced multi‑view learning issue. For example, in emotion recognition tasks, video frames often reach dimensions of 10^6, while physiological signals comprise only 10^1 dimensions. Existing methods typically face two main challenges for this problem: (1) They often bias towards high‑dimensional data, overlooking the low‑dimensional views. (2) They struggle to effectively align representations under extreme dimensional imbalance, which introduces severe redundancy into the low‑dimensional ones. To address these issues, we propose the Adaptive Multi‑view Sparsity Learning (AdaMuS) framework. First, to prevent ignoring the information of low‑dimensional views, we construct view‑specific encoders to map them into a unified dimensional space. Given that mapping low‑dimensional data to a high‑dimensional space often causes severe overfitting, we design a parameter‑free pruning method to adaptively remove redundant parameters in the encoders. Furthermore, we propose a sparse fusion paradigm that flexibly suppresses redundant dimensions and effectively aligns each view. Additionally, to learn representations with stronger generalization, we propose a self‑supervised learning paradigm that obtains supervision information by constructing similarity graphs. Extensive evaluations on a synthetic toy dataset and seven real‑world benchmarks demonstrate that AdaMuS consistently achieves superior performance and exhibits strong generalization across both classification and semantic segmentation tasks.

Abstract:
A symmetry on rigid motion is one of the salient factors in efficient learning of 3D point cloud problems. Group convolution has been a representative method to extract equivariant features, but its realizations have struggled to retain both rigorous symmetry and scalability simultaneously. We advocate utilizing the intertwiner framework to resolve this trade‑off, but previous works on it, which did not achieve complete SE(3) symmetry or scalability to large‑scale problems, necessitate a more advanced kernel architecture. We present Equivariant Coordinate‑based Kernel Convolution, or ECKConv. It acquires SE(3) equivariance from the kernel domain defined in a double coset space, and its explicit kernel design using coordinate‑based networks enhances its learning capability and memory efficiency. The experiments on diverse point cloud tasks, e.g., classification, pose registration, part segmentation, and large‑scale semantic segmentation, validate the rigid equivariance, memory scalability, and outstanding performance of ECKConv compared to state‑of‑the‑art equivariant methods.

Abstract:
We present a modular, full‑stack autonomy system for lunar surface navigation and mapping developed for the Lunar Autonomy Challenge. Operating in a GNSS‑denied, visually challenging environment, our pipeline integrates semantic segmentation, stereo visual odometry, pose graph SLAM with loop closures, and layered planning and control. We leverage lightweight learning‑based perception models for real‑time segmentation and feature tracking and use a factor‑graph backend to maintain globally consistent localization. High‑level waypoint planning is designed to promote mapping coverage while encouraging frequent loop closures, and local motion planning uses arc sampling with geometric obstacle checks for efficient, reactive control. We evaluate our approach in the competition's high‑fidelity lunar simulator, demonstrating centimeter‑level localization accuracy, high‑fidelity map generation, and strong repeatability across random seeds and rock distributions. Our solution achieved first place in the final competition evaluation.

Abstract:
Mobile robots are increasingly deployed for inspection, patrol, and search‑and‑rescue operations, relying on computer vision for perception, navigation, and autonomous decision‑making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user‑defined workloads to run on resource‑constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI‑based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware‑optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real‑world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot‑side power consumption by up to 80% and edge‑side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery‑powered robots.

Abstract:
Accurate semantic segmentation of 3D dental models is essential for digital dentistry applications such as orthodontics and dental implants. However, due to complex tooth arrangements and similarities in shape among adjacent teeth, existing methods struggle with accurate segmentation, because they often focus on local geometry while neglecting global contextual information. To address this, we propose TCATSeg, a novel framework that combines local geometric features with global semantic context. We introduce a set of sparse yet physically meaningful superpoints to capture global semantic relationships and enhance segmentation accuracy. Additionally, we present a new dataset of 400 dental models, including pre‑orthodontic samples, to evaluate the generalization of our method. Extensive experiments demonstrate that TCATSeg outperforms state‑of‑the‑art approaches.

Abstract:
Large Vision‑Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation‑based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object‑level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE‑guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM‑driven perception and decision‑making.

Abstract:
The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non‑causal interactions between image patches. Prior works have attempted to address this limitation through various multi‑scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF‑Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF‑Mamba significantly outperforms state‑of‑the‑art baselines while improving throughput across different model sizes. We will release the source code after publication.

Abstract:
Semantic segmentation models are widely deployed in safety‑critical applications such as autonomous driving, yet their vulnerability to backdoor attacks remains largely underexplored. Prior segmentation backdoor studies transfer threat settings from existing image classification tasks, focusing primarily on object‑to‑background mis‑segmentation. In this work, we revisit the threats by systematically examining backdoor attacks tailored to semantic segmentation. We identify four coarse‑grained attack vectors (Object‑to‑Object, Object‑to‑Background, Background‑to‑Object, and Background‑to‑Background attacks), as well as two fine‑grained vectors (Instance‑Level and Conditional attacks). To formalize these attacks, we introduce BADSEG, a unified framework that optimizes trigger designs and applies label manipulation strategies to maximize attack performance while preserving victim model utility. Extensive experiments across diverse segmentation architectures on benchmark datasets demonstrate that BADSEG achieves high attack effectiveness with minimal impact on clean samples. We further evaluate six representative defenses and find that they fail to reliably mitigate our attacks, revealing critical gaps in current defenses. Finally, we demonstrate that these vulnerabilities persist in recent emerging architectures, including transformer‑based networks and the Segment Anything Model (SAM), thereby compromising their security. Our work reveals previously overlooked security vulnerabilities in semantic segmentation, and motivates the development of defenses tailored to segmentation‑specific threat models.

Abstract:
Semi‑supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point‑based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine‑grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion‑Constrained Dual‑Prompt SAM (EDP‑SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity‑Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi‑supervised crowd counting framework that uses instance mask priors as pseudo‑labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF‑QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end‑to‑end model achieves state‑of‑the‑art semi‑supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

Abstract:
We introduce FEEL (Force‑Enhanced Egocentric Learning), the first large‑scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force‑synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand‑object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel‑level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self‑supervised pretraining objective for video backbones. We achieve state‑of‑the‑art temporal contact segmentation results and competitive pixel‑level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC‑Kitchens, SomethingSomething‑V2, EgoExo4D and Meccano.

Abstract:
The landscape of self‑supervised learning (SSL) is currently dominated by generative approaches (e.g., MAE) that reconstruct raw low‑level data, and predictive approaches (e.g., I‑JEPA) that predict high‑level abstract embeddings. While generative methods provide strong grounding, they are computationally inefficient for high‑redundancy modalities like imagery, and their training objective does not prioritize learning high‑level, conceptual features. Conversely, predictive methods often suffer from training instability due to their reliance on the non‑stationary targets of final‑layer self‑distillation. We introduce Bootleg, a method that bridges this divide by tasking the model with predicting latent representations from multiple hidden layers of a teacher network. This hierarchical objective forces the model to capture features at varying levels of abstraction simultaneously. We demonstrate that Bootleg significantly outperforms comparable baselines (+10% over I‑JEPA) on classification of ImageNet‑1K and iNaturalist‑21, and semantic segmentation of ADE20K and Cityscapes.

Abstract:
Masked auto‑encoders (MAE) and related approaches have shown promise for satellite imagery, but their application to synthetic aperture radar (SAR) remains limited due to challenges in semantic labeling and high noise levels. Building on our prior work with SAR‑W‑MixMAE, which adds SAR‑specific intensity‑weighted loss to standard MixMAE for pretraining, we also introduce SAR‑W‑SimMIM; a weighted variant of SimMIM applied to ALOS‑2 single‑channel SAR imagery. This method aims to reduce the impact of speckle and extreme intensity values during self‑supervised pretraining. We evaluate its effect on semantic segmentation compared to our previous trial with SAR‑W‑MixMAE and random initialization, observing notable improvements. In addition, pretraining and fine‑tuning models on satellite imagery pose unique challenges, particularly when developing region‑specific models. Imbalanced land cover distributions such as dominant water, forest, or desert areas can introduce bias, affecting both pretraining and downstream tasks like land cover segmentation. To address this, we constructed a SAR dataset using ALOS‑2 single‑channel (HH polarization) imagery focused on the Japan region, marking the initial phase toward a national‑scale foundation model. This dataset was used to pretrain a vision transformer‑based autoencoder, with the resulting encoder fine‑tuned for semantic segmentation using a task‑specific decoder. Initial results demonstrate significant performance improvements compared to training from scratch with random initialization. In summary, this work provides a guide to process and prepare ALOS2 observations to create dataset so that it can be taken advantage of self‑supervised pretraining of models and finetuning downstream tasks such as semantic segmentation.

Abstract:
Inspection of confined infrastructure such as culverts often requires accessing hidden spaces whose entrances are reachable primarily from elevated viewpoints. Aerial‑ground cooperation enables a UAV to deploy a compact UGV for interior exploration, but selecting a suitable deployment region from aerial observations requires metric terrain reasoning involving scale ambiguity, reconstruction uncertainty, and terrain semantics. We present a metric RGB‑based geometric‑semantic reconstruction and traversability analysis framework for aerial‑to‑ground hidden space inspection. A feed‑forward multi‑view RGB reconstruction backbone produces dense geometry, while temporally consistent semantic segmentation yields a 3D semantic map. To enable deployment‑relevant measurements without LiDAR‑based dense mapping, we introduce an embodied motion prior that recovers metric scale by enforcing consistency between predicted camera motion and onboard platform egomotion. From the metrically grounded reconstruction, we construct a confidence‑aware geometric‑semantic traversability map and evaluate candidate deployment zones under explicit reachability constraints. Experiments on a tethered UAV‑UGV platform demonstrate reliable deployment‑zone identification in hidden space scenarios.

Abstract:
3D instance segmentation for laser scanning (LiDAR) point clouds remains a challenge in many remote sensing‑related domains. Successful solutions typically rely on supervised deep learning and manual annotations, and consequently focus on objects that can be well delineated through visual inspection and manual labeling of point clouds. However, for tasks with more complex and cluttered scenes, such as in‑field plant phenotyping in agriculture, such approaches are often infeasible. In this study, we tackle the task of in‑field wheat head instance segmentation directly from terrestrial laser scanning (TLS) point clouds. To address the problem and circumvent the need for manual annotations, we propose a novel two‑stage pipeline. To obtain the initial 3D instance proposals, the first stage uses 3D‑to‑2D multi‑view projections, the Grounded SAM pipeline for zero‑shot 2D object‑centric segmentation, and multi‑view label fusion. The second stage uses these initial proposals as noisy pseudo‑labels to train a supervised 3D panoptic‑style segmentation neural network. Our results demonstrate the feasibility of the proposed approach and show performance improvementsrelative to Wheat3DGS, a recent alternative solution for in‑field wheat head instance segmentation without manual 3D annotations based on multi‑view RGB images and 3D Gaussian Splatting, showcasing TLS as a competitive sensing alternative. Moreover, the results show that both stages of the proposed pipeline can deliver usable 3D instance segmentation without manual annotations, indicating promising, low‑effort transferability to other comparable TLS‑based point cloud segmentation tasks.

Abstract:
Referring video object segmentation (RVOS) has recently generated great popularity in computer vision due to its widespread applications. Existing RVOS setting contains elaborately trimmed videos, with text‑referred objects always appearing in all frames, which however fail to fully reflect the realistic challenges of this task. This simplified setting requires RVOS methods to only predict where objects, with no need to show when the objects appear. In this work, we introduce a new setting towards in‑the‑wild RVOS. To this end, we collect a new benchmark dataset using Youtube Untrimmed videos for RVOS ‑ YoURVOS, which contains 1,120 in‑the‑wild videos with 7 times more duration and scenes than existing datasets. Our new benchmark challenges RVOS methods to show not only where but also when objects appear in videos. To set a baseline, we propose Object‑level Multimodal TransFormers (OMFormer) to tackle the challenges, which are characterized by encoding object‑level multimodal interactions for efficient and global spatial‑temporal localisation. We demonstrate that previous VOS methods struggle on our YoURVOS benchmark, especially with the increase of target‑absent frames, while our OMFormer consistently performs well. Our YoURVOS dataset offers an imperative benchmark, which will push forward the advancement of RVOS methods for practical applications.

Abstract:
Vision foundation models trained with self‑supervised objectives achieve strong performance across diverse tasks and exhibit emergent object segmentation properties. However, their alignment with human object perception remains poorly understood. Here, we introduce a behavioral benchmark in which participants make same/different object judgments for dot pairs on naturalistic scenes, scaling up a classical psychophysics paradigm to over 1000 trials. We test a diverse set of vision models using a simple readout from their representations to predict subjects' reaction times. We observe a steady improvement across model generations, with both architecture and training objective contributing to alignment, and transformer‑based models trained with the DINO self‑supervised objective showing the strongest performance. To investigate the source of this improvement, we propose a novel metric to quantify the object‑centric component of representations by measuring patch similarity within and between objects. Across models, stronger object‑centric structure predicts human segmentation behavior more accurately. We further show that matching the Gram matrix of supervised transformer models, capturing similarity structure across image patches, with that of a self‑supervised model through distillation improves their alignment with human behavior, converging with the prior finding that Gram anchoring improves DINOv3's feature quality. Together, these results demonstrate that self‑supervised vision models capture object structure in a behaviorally human‑like manner, and that Gram matrix structure plays a role in driving perceptual alignment.

Abstract:
Underwater salient instance segmentation (USIS) is crucial for marine robotic systems, as it enables both underwater salient object detection and instance‑level mask prediction for visual scene understanding. Compared with its terrestrial counterpart, USIS is more challenging due to the underwater image degradation. To address this issue, this paper proposes USIS‑PGM, a single‑stage framework for USIS. Specifically, the encoder enhances boundary cues through a frequency‑aware module and performs content‑adaptive feature reweighting via a dynamic weighting module. The decoder incorporates a Transformer‑based instance activation module to better distinguish salient instances. In addition, USIS‑PGM employs multi‑scale Gaussian heatmaps generated from ground‑truth masks through Photometric Gaussian Mixture (PGM) to supervise intermediate decoder features, thereby improving salient instance localization and producing more structurally coherent mask predictions. Experimental results demonstrate the superiority and practical applicability of the proposed USIS‑PGM model.

Abstract:
The recent years have witnessed the remarkable development for open‑vocabulary semantic segmentation (OVSS) using visual‑language foundation models, yet still suffer from following fundamental challenges: (1) insufficient cross‑modal communications between textual and visual spaces, and (2) significant computational costs from the interactions with massive number of categories. To address these issues, this paper describes a novel coarse‑to‑fine framework, called DCP‑CLIP, for OVSS. Unlike prior efforts that mainly relied on pre‑established category content and the inherent spatial‑class interaction capability of CLIP, we dynamic constructing category‑relevant textual features and explicitly models dual interactions between spatial image features and textual class semantics. Specifically, we first leverage CLIP's open‑vocabulary recognition capability to identify semantic categories relevant to the image context, upon which we dynamically generate corresponding textual features to serve as initial textual guidance. Subsequently, we conduct a coarse segmentation by cross‑modally integrating semantic information from textual guidance into the visual representations and achieve refined segmentation by integrating spatially enriched features from the encoder to recover fine‑grained details and enhance spatial resolution. In final, we leverage spatial information from the segmentation side to refine category predictions for each mask, facilitating more precise semantic labeling. Experiments on multiple OVSS benchmarks demonstrate that DCP‑CLIP outperforms existing methods by delivering both higher accuracy and greater efficiency.

Abstract:
Humans develop visual intelligence through perceiving and interacting with their environment ‑ a self‑supervised learning process grounded in egocentric experience. Inspired by this, we ask how can artificial systems learn stable object representations from continuous, uncurated first‑person videos without relying on manual annotations. This setting poses challenges of separating, recognizing, and persistently tracking objects amid clutter, occlusion, and ego‑motion. We propose EgoViT, a unified vision Transformer framework designed to learn stable object representations from unlabeled egocentric video. EgoViT bootstraps this learning process by jointly discovering and stabilizing "proto‑objects" through three synergistic mechanisms: (1) Proto‑object Learning, which uses intra‑frame distillation to form discriminative representations; (2) Depth Regularization, which grounds these representations in geometric structure; and (3) Teacher‑Filtered Temporal Consistency, which enforces identity over time. This creates a virtuous cycle where initial object hypotheses are progressively refined into stable, persistent representations. The framework is trained end‑to‑end on unlabeled first‑person videos and exhibits robustness to geometric priors of varied origin and quality. On standard benchmarks, EgoViT achieves +8.0% CorLoc improvement in unsupervised object discovery and +4.8% mIoU improvement in semantic segmentation, demonstrating its potential to lay a foundation for robust visual abstraction in embodied intelligence.

Abstract:
Continual semantic segmentation (CSS) is a cornerstone task in computer vision that enables a large number of downstream applications, but faces the catastrophic forgetting challenge. In conventional class‑incremental semantic segmentation (CISS) frameworks using Softmax‑based classification heads, catastrophic forgetting originates from Catastrophic forgetting and task affiliation probability. We formulate these problems and provide a theoretical analysis to more deeply understand the limitations in existing CISS methods, particularly Strict Parameter Isolation (SPI). To address these challenges, we follow a dual‑phase intuition from human annotators, and introduce Cognitive Cascade Segmentation (CogCaS), a novel dual‑phase cascade formulation for CSS tasks in the CISS setting. By decoupling the task into class‑existence detection and class‑specific segmentation, CogCaS enables more effective continual learning, preserving previously learned knowledge while incorporating new classes. Using two benchmark datasets PASCAL VOC 2012 and ADE20K, we have shown significant improvements in a variety of challenging scenarios, particularly those with long sequence of incremental tasks, when compared to exsiting state‑of‑the‑art methods. Our code will be made publicly available upon paper acceptance.

Abstract:
Feature matching is a fundamental problem in computer vision with wide‑ranging applications, including simultaneous localization and mapping (SLAM), image stitching, and 3D reconstruction. While recent advances in deep learning have improved keypoint detection and description, most approaches focus primarily on geometric attributes and often neglect higher‑level semantic information. This work proposes a semantic‑aware feature extraction framework that employs multi‑task learning to jointly train keypoint detection, keypoint description, and semantic segmentation. The method is benchmarked against standard feature matching techniques and evaluated in the context of 3D reconstruction. To enhance feature correspondence, a deep matching module is integrated. The system is tested using input from a single monocular fisheye camera mounted on a vehicle and evaluated within a multi‑floor parking structure. The proposed approach supports semantic 3D reconstruction with altitude estimation, capturing elevation changes and enabling multi‑level mapping. Experimental results demonstrate that the method produces semantically annotated 3D point clouds with improved structural detail and elevation information, underscoring the effectiveness of joint training with semantic cues for more consistent feature matching and enhanced 3D reconstruction.

Abstract:
In this thesis, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low‑altitude unstructured environments. We propose a joint deep‑learning architecture, named Co‑SemDepth, that can perform the two tasks accurately and rapidly, and validate its effectiveness on a variety of datasets. The training of neural networks requires an abundance of annotated data, and in the UAV field, the availability of such data is limited. We introduce a new synthetic dataset in this thesis, TopAir that contains images captured with a nadir view in outdoor environments at different altitudes, helping to fill the gap. While using synthetic data for the training is convenient, it raises issues when shifting to the real domain for testing. We conduct an extensive analytical study to assess the effect of several factors on the synthetic‑to‑real generalization. Co‑SemDepth and TaskPrompter models are used for comparison in this study. The results reveal a superior generalization performance for Co‑SemDepth in depth estimation and for TaskPrompter in semantic segmentation. Also, our analysis allows us to determine which training datasets lead to a better generalization. Moreover, to help attenuate the gap between the synthetic and real domains, image style transfer techniques are explored on aerial images to convert from the synthetic to the realistic style. Cycle‑GAN and Diffusion models are employed. The results reveal that diffusion models are better in the synthetic to real style transfer. In the end, we focus on the marine domain and address its challenges. Co‑SemDepth is trained on a collected synthetic marine data, called MidSea, and tested on both synthetic and real data. The results reveal good generalization performance of Co‑SemDepth when tested on real data from the SMD dataset while further enhancement is needed on the MIT dataset.

Abstract:
Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While fine‑tuning foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one‑size‑fits‑all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine‑grained, spatially‑adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel fine‑tuning framework for DGSS. It operationalizes this principle by utilizing a Mixture‑of‑Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine‑tuning process. Specifically, SpectralMoE employs a dual‑gated MoE architecture that independently routes visual and depth features to top‑k selected experts for specialized refinement, enabling modality‑specific adjustments. A subsequent cross‑attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state‑of‑the‑art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.

Abstract:
Large‑scale models are typically adapted to meet the diverse requirements of model owners and users. However, maintaining multiple specialized versions of the model is inefficient. In response, we propose AIM, a novel model modulation paradigm that enables a single model to exhibit diverse behaviors to meet the specific end requirements. AIM enables two key modulation modes: utility and focus modulations. The former provides model owners with dynamic control over output quality to deliver varying utility levels, and the latter offers users precise control to shift model's focused input features. AIM introduces a logits redistribution strategy that operates in a training data‑agnostic and retraining‑free manner. We establish a formal foundation to ensure AIM's regulation capability, based on the statistical properties of logits ordering via joint probability distributions. Our evaluation confirms AIM's practicality and versatility for Al model modulation, with tasks spanning image classification, semantic segmentation and text generation, and prevalent architectures including ResNet, SegFormer and Llama.

Abstract:
Synthetic Aperture Radar (SAR) enables global, all‑weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth‑SAR, the first billion‑scale SAR vision foundation model built upon a novel physics‑guided sparse mixture‑of‑experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross‑domain semantic segmentation. To facilitate large‑scale pre‑training, we develop CrossEarth‑SAR‑200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub‑benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth‑SAR achieves state‑of‑the‑art results on 20 benchmarks, surpassing previous methods by over 10% mIoU on some benchmarks under multi‑gap transfer. All code, benchmark and datasets will be publicly available.

Abstract:
As Extended Reality (XR) systems increasingly map and understand the physical world, interacting with these blended representations remains challenging. The current push for "natural" inputs has its trade‑offs: touch is limited by human reach and fatigue, while gaze often lacks the precision for fine interaction. To bridge this gap, we introduce World Mouse, a cross‑reality cursor that reinterprets the familiar 2D desktop mouse for complex 3D scenes. The system is driven by two core mechanisms: within‑object interaction, which uses surface normals for precise cursor placement, and between‑object navigation, which leverages interpolation to traverse empty space. Unlike previous virtual‑only approaches, World Mouse leverages semantic segmentation and mesh reconstruction to treat physical objects as interactive surfaces. Through a series of prototypes, including object manipulation and screen‑to‑world transitions, we illustrate how cross‑reality cursors may enable seamless interactions across real and virtual environments.

Abstract:
Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real‑world settings. AI‑based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real‑world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3‑Pills subset and 80.1 percent on the 32‑Pills subset. We further evaluate MEDISEG under a few‑shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi‑pill scenarios compared to existing datasets. These results highlight the dataset's ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI‑driven systems for medication safety.

Abstract:
Reliable visual monitoring of chemical experiments remains challenging in transparent glassware, where weak phase boundaries and optical artifacts degrade conventional segmentation. We formulate laboratory phenomena as the time evolution of phase interfaces and introduce the Chemical Transparent Glasses dataset 2.0 (CTG 2.0), a vessel‑aware benchmark with 3,668 images, 23 glassware categories, and five multiphase interface types for phase‑interface instance segmentation. Building on YOLO11m‑seg, we propose LGA‑RCM‑YOLO, which combines Local‑Global Attention (LGA) for robust semantic representation and a Rectangular Self‑Calibration Module (RCM) for boundary refinement of thin, elongated interfaces. On CTG 2.0, the proposed model achieves 84.4% AP@0.5 and 58.43% AP@0.5‑0.95, improving over the YOLO11m baseline by 6.42 and 8.75 AP points, respectively, while maintaining near real‑time inference (13.67 FPS, RTX 3060). An auxiliary color‑attribute head further labels liquid instances as colored or colorless with 98.71% precision and 98.32% recall. Finally, we demonstrate continuous process monitoring in separatory‑funnel phase separation and crystallization, showing that phase‑interface instance segmentation can serve as a practical visual sensor for laboratory automation.

Abstract:
Self‑supervised visual pre‑training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine‑grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically‑agnostic random masking. We propose C2FMAE, a coarse‑to‑fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene‑level), instance masks (object‑level), and RGB images (pixel‑level). Two synergistic innovations enforce a strict top‑down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross‑granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic‑guided to instance‑guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large‑scale multi‑granular dataset with high‑quality pseudo‑labels for all 1.28M ImageNet‑1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.

Abstract:
Achieving robust spatial reasoning remains a fundamental challenge for current Multimodal Foundation Models (MFMs). Existing methods either overfit statistical shortcuts via 3D grounding data or remain confined to 2D visual perception, limiting both spatial reasoning accuracy and generalization in unseen scenarios. Inspired by the spatial cognitive mapping mechanisms of biological intelligence, we propose World2Mind, a training‑free spatial intelligence toolkit. At its core, World2Mind leverages 3D reconstruction and instance segmentation models to construct structured spatial cognitive maps, empowering MFMs to proactively acquire targeted spatial knowledge regarding interested landmarks and routes of interest. To provide robust geometric‑topological priors, World2Mind synthesizes an Allocentric‑Spatial Tree (AST) that uses elliptical parameters to model the top‑down layout of landmarks accurately. To mitigate the inherent inaccuracies of 3D reconstruction, we introduce a three‑stage reasoning chain comprising tool invocation assessment, modality‑decoupled cue collection, and geometry‑semantics interwoven reasoning. Extensive experiments demonstrate that World2Mind boosts the performance of frontier models, such as GPT‑5.2, by 5%~18%. Astonishingly, relying solely on the AST‑structured text, purely text‑only foundation models can perform complex 3D spatial reasoning, achieving performance approaching that of advanced multimodal models.

Abstract:
Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision‑language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large‑scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross‑modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real‑data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at zenodo.org/records/18890661 and the code base at github.com/caglarmert/ARAS400k.

Abstract:
One of the bottlenecks for instance segmentation today lies in the conflicting requirements of high‑resolution inputs and lightweight, real‑time inference. To address this bottleneck, we present a Polygon Detection Transformer (Poly‑DETR) to reformulate instance segmentation as sparse vertex regression via Polar Representation, thereby eliminating the reliance on dense pixel‑wise mask prediction. Considering the box‑to‑polygon reference shift in Detection Transformers, we propose Polar Deformable Attention and Position‑Aware Training Scheme to dynamically update supervision and focus attention on boundary cues. Compared with state‑of‑the‑art polar‑based methods, Poly‑DETR achieves a 4.7 mAP improvement on MS COCO test‑dev. Moreover, we construct a parallel mask‑based counterpart to support a systematic comparison between polar and mask representations. Experimental results show that Poly‑DETR is more lightweight in high‑resolution scenarios, reducing memory consumption by almost half on Cityscapes dataset. Notably, on PanNuke (cell segmentation) and SpaceNet (building footprints) datasets, Poly‑DETR surpasses its mask‑based counterpart on all metrics, which validates its advantage on regular‑shaped instances in domain‑specific settings.

Abstract:
This study proposes an enhanced dual‑model YOLOv8 framework for intelligent fire detection and proximity‑aware risk assessment, extending conventional vision‑based monitoring beyond simple detection to actionable hazard prioritization. The system is trained on a dataset of 9,860 annotated images to segment fire and smoke across complex environments. The framework combines a primary YOLOv8 instance segmentation model for fire and smoke detection with a secondary object detection model pretrained on the COCO dataset to identify surrounding entities such as people, vehicles, and infrastructure. By integrating the outputs of both models, the system computes pixel‑based distances between detected fire regions and nearby objects and converts these values into approximate real‑world measurements using a pixel‑to‑meter scaling approach. This proximity information is incorporated into a risk assessment mechanism that combines fire evidence, object vulnerability, and distance‑based exposure to produce a quantitative risk score and alert level. The proposed framework achieves strong performance, with precision, recall, and F1 scores exceeding 90% and mAP@0.5 above 91%. The system generates annotated visual outputs showing fire locations, detected objects, estimated distances, and contextual risk information to support situational awareness. Implemented using open‑source tools within the Google Colab environment, the framework is lightweight and suitable for deployment in industrial and resource‑constrained settings.

Abstract:
Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.

Abstract:
Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision‑free, execution‑feasible approaches. In this paper we present an end‑to‑end pipeline for language‑guided grasping that bridges open‑vocabulary target selection to safe grasp execution on a real robot. Given a natural‑language command, the system grounds the target in RGB using open‑vocabulary detection and promptable instance segmentation, extracts an object‑centric point cloud from RGB‑D, and improves geometric reliability under occlusion via back‑projected depth compensation and two‑stage point cloud completion. We then generate and collision‑filter 6‑DoF grasp candidates and select an executable grasp using safety‑oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view‑dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.

Abstract:
Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task‑specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB‑D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local‑global feature representation. The instance segmentation task benefits from a non‑bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi‑task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB‑D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.

Abstract:
Local material inhomogeneities can strongly influence magnetization dynamics and macroscopic magnetic properties, yet detecting such defects from magnetic imaging data remains challenging when thermal fluctuations and experimental noise obscure static contrast. Here, we investigate defect detection in strongly fluctuating magnetization regimes where signatures of inhomogeneities largely average out in time‑resolved measurements. Using finite‑temperature micromagnetic simulations with randomly distributed defects and material parameters representative of \ceNi80Fe20, we compute per‑pixel temporal mean, temporal standard deviation, and latent entropy and use them as inputs for U‑Net‑based semantic segmentation models. We find that the most effective descriptor depends on the noise level and, importantly, that robust detection requires training data that reflect the expected noise statistics. These results provide practical guidance for designing noise‑robust defect‑detection workflows in magnetic imaging.

Abstract:
In this paper, we identify low‑dimensional models for dense core subsets in the space of 3× 3 high‑contrast optical flow patches sampled from the Sintel dataset. In particular, we leverage the theory of approximate and discrete circle bundles to identify a 3‑manifold whose boundary is a previously proposed optical flow torus, together with disjoint circles corresponding to pairs of binary step‑edge range image patches. The 3‑manifold model we introduce provides an explanation for why the previously‑proposed torus model could not be verified with direct methods (e.g., a straightforward persistent homology computation). We also demonstrate that nearly all optical flow patches in the top 1 percent by contrast norm are found near the family of binary step‑edge circles described above, rather than the optical flow torus, and that these frequently occurring patches are concentrated near motion boundaries (which are of particular importance for computer vision tasks such as object segmentation and tracking). Our findings offer insights on the subtle interplay between topology and geometry in inference for visual data.

Abstract:
Rigid bodies constitute the smallest manipulable elements in the real world, and understanding how they physically interact is fundamental to embodied reasoning and robotic manipulation. Thus, accurate detection, segmentation, and tracking of moving rigid bodies is essential for enabling reasoning modules to interpret and act in diverse environments. However, current segmentation models trained on semantic grouping are limited in their ability to provide meaningful interaction‑level cues for completing embodied tasks. To address this gap, we introduce MotionBit, a novel concept that, unlike prior formulations, defines the smallest unit in motion‑based segmentation through kinematic spatial twist equivalence, independent of semantics. In this paper, we contribute (1) the MotionBit concept and definition, (2) a hand‑labeled benchmark, called MoRiBo, for evaluating moving rigid‑body segmentation across robotic manipulation and human‑in‑the‑wild videos, and (3) a learning‑free graph‑based MotionBits segmentation method that outperforms state‑of‑the‑art embodied perception methods by 37.3% in macro‑averaged mIoU on the MoRiBo benchmark. Finally, we demonstrate the effectiveness of MotionBits segmentation for downstream embodied reasoning and manipulation tasks, highlighting its importance as a fundamental primitive for understanding physical interactions.

Abstract:
We present Rewis3d, a framework that leverages recent advances in feed‑forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel‑level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly‑supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student‑teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state‑of‑the‑art feed‑forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state‑of‑the‑art performance in sparse supervision, outperforming existing approaches by 2‑7% without requiring additional labels or inference overhead.

Abstract:
Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre‑training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library‑driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P‑SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high‑quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter‑relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter‑relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area‑5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.

Abstract:
Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed‑label models. In this paper, we present JOPP‑3D, an open‑vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language‑driven scene understanding. We convert RGB‑D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision‑language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford‑2D‑3D‑s and ToF‑360 datasets demonstrates the capability of JOPP‑3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.

Abstract:
To address the limitations of existing open‑vocabulary object recognition methods, specifically high system complexity, substantial training costs, and limited generalization, this paper proposes a novel Open‑Vocabulary Object Recognition (OVOR) framework based on a streamlined two‑stage strategy: object segmentation followed by recognition. The framework eliminates the need for complex retraining and labor‑intensive annotation. After cropping object regions, we generate object‑level image embeddings alongside category‑level text embeddings using CLIP, which facilitates arbitrary vocabularies. To reduce reliance on CLIP and enhance encoding flexibility, we further introduce a CNN/MLP‑based method that extracts convolutional neural network (CNN) feature maps and utilizes a multilayer perceptron (MLP) to align visual features with text embeddings. These embeddings are concatenated and processed via Singular Value Decomposition (SVD) to construct a shared representation space. Finally, recognition is performed through embedding similarity matching. Experiments on COCO, Pascal VOC, and ADE20K demonstrate that training‑free, CLIP‑based encoding without SVD achieves the highest average AP, outperforming current state‑of‑the‑art methods. Simultaneously, the results highlight the potential of CNN/MLP‑based image encoding for OVOR.

Abstract:
Traditional safety‑critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe‑SAGE (Social‑Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high‑level semantic understanding and low‑level safety‑critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi‑sensor point clouds with vision‑based instance segmentation and persistent object tracking to maintain up‑to‑date semantics beyond the camera's field of view. A multi‑layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi‑agent passing norms for different obstacles in the environment. Our framework enables legged robots to safely navigate semantically rich, dynamic environments with context‑dependent safety margins.

Abstract:
Reliable long‑term deployment of autonomous robots in agricultural environments remains challenging due to perceptual aliasing, seasonal variability, and the dynamic nature of crop canopies. Vineyards, characterized by repetitive row structures and significant visual changes across phenological stages, represent a pivotal field challenge, limiting the robustness of conventional feature‑based localization and mapping approaches. This paper introduces VinePT‑Map, a semantic mapping framework that leverages vine trunks and support poles as persistent structural landmarks to enable season‑agnostic and resilient robot localization. The proposed method formulates the mapping problem as a factor graph, integrating GPS, IMU, and RGB‑D observations through robust geometrical constraints that exploit vineyard structure. An efficient perception pipeline based on instance segmentation and tracking, combined with a clustering filter for outlier rejection and pose refinement, enables accurate landmark detection using low‑cost sensors and onboard computation. To validate the pipeline, we present a multi‑season dataset for trunk and pole segmentation and tracking. Extensive field experiments conducted across diverse seasons demonstrate the robustness and accuracy of the proposed approach, highlighting its suitability for long‑term autonomous operation in agricultural environments.

Abstract:
Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single‑sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state‑of‑the‑art performance on both the HCMSSD and Semap datasets, showing that a diversity‑driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.

Abstract:
Distribution shifts between training and testing data are a critical bottleneck limiting the practical utility of models, especially in real‑world test‑time scenarios. To adapt models when the source domain is unknown and the target domain is unlabeled, previous works constructed pseudo‑source domains via data generation and translation, then aligned the target domain with them. However, significant discrepancies exist between the pseudo‑source and the original source domain, leading to potential divergence when correcting the target directly. From this perspective, we propose a Stepwise Semantic Alignment (SSA) method, viewing the pseudo‑source as a semantic bridge connecting the source and target, rather than a direct substitute for the source. Specifically, we leverage easily accessible universal semantics to rectify the semantic features of the pseudo‑source, and then align the target domain using the corrected pseudo‑source semantics. Additionally, we introduce a Hierarchical Feature Aggregation (HFA) module and a Confidence‑Aware Complementary Learning (CACL) strategy to enhance the semantic quality of the SSA process in the absence of source and ground truth of target domains. We evaluated our approach on tasks like semantic segmentation and image classification, achieving a 5.2% performance boost on GTA2Cityscapes over the state‑of‑the‑art.

Abstract:
Construction aggregates, including sand and gravel, crushed stone and riprap, are the core building blocks of the construction industry. State‑of‑the‑practice characterization methods mainly relies on visual inspection and manual measurement. State‑of‑the‑art aggregate imaging methods have limitations that are only applicable to regular‑sized aggregates under well‑controlled conditions. This dissertation addresses these major challenges by developing a field imaging framework for the morphological characterization of aggregates as a multi‑scenario solution. For individual and non‑overlapping aggregates, a field imaging system was designed and the associated segmentation and volume estimation algorithms were developed. For 2D image analyses of aggregates in stockpiles, an automated 2D instance segmentation and morphological analysis approach was established. For 3D point cloud analyses of aggregate stockpiles, an integrated 3D Reconstruction‑Segmentation‑Completion (RSC‑3D) approach was established: 3D reconstruction procedures from multi‑view images, 3D stockpile instance segmentation, and 3D shape completion to predict the unseen sides. First, a 3D reconstruction procedure was developed to obtain high‑fidelity 3D models of collected aggregate samples, based on which a 3D aggregate particle library was constructed. Next, two datasets were derived from the 3D particle library for 3D learning: a synthetic dataset of aggregate stockpiles with ground‑truth instance labels, and a dataset of partial‑complete shape pairs, developed with varying‑view raycasting schemes. A state‑of‑the‑art 3D instance segmentation network and a 3D shape completion network were trained on the datasets, respectively. The application of the integrated approach was demonstrated on real stockpiles and validated with ground‑truth, showing good performance in capturing and predicting the unseen sides of aggregates.

Abstract:
Intelligent forest tree breeding has advanced plant phenotyping, yet existing research largely focuses on large‑leaf agricultural crops, with limited attention to fine‑grained leaf analysis of sapling trees in open‑field environments. Natural scenes introduce challenges including scale variation, illumination changes, and irregular leaf morphology. To address these issues, we collected UAV RGB imagery of field‑grown saplings and constructed the Poplar‑leaf dataset, containing 1,202 branches and 19,876 pixel‑level annotated leaf instances. To our knowledge, this is the first instance segmentation dataset specifically designed for forestry leaves in open‑field conditions. We propose LeafInst, a novel segmentation framework tailored for irregular and multi‑scale leaf structures. The model integrates an Asymptotic Feature Pyramid Network (AFPN) for multi‑scale perception, a Dynamic Asymmetric Spatial Perception (DASP) module for irregular shape modeling, and a dual‑residual Dynamic Anomalous Regression Head (DARH) with Top‑down Concatenation decoder Feature Fusion (TCFU) to improve detection and segmentation performance. On Poplar‑leaf, LeafInst achieves 68.4 mAP, outperforming YOLOv11 by 7.1 percent and MaskDINO by 6.5 percent. On the public PhenoBench benchmark, it reaches 52.7 box mAP, exceeding MaskDINO by 3.4 percent. Additional experiments demonstrate strong generalization and practical utility for large‑scale leaf phenotyping.

Abstract:
Accurate sea ice mapping is essential for safe maritime navigation in polar regions, where rapidly changing ice conditions require timely and reliable information. While Sentinel‑1 Synthetic Aperture Radar (SAR) provides high‑resolution, all‑weather observations of sea ice, conventional ground‑based processing is limited by downlink bandwidth, latency, and energy costs associated with transmitting large volumes of raw data. On‑board processing, enabled by dedicated inference chips integrated directly within the satellite payload, offers a transformative alternative by generating actionable sea ice products in orbit. In this context, we present TinyIceNet, a compact semantic segmentation network co‑designed for on‑board Stage of Development (SOD) mapping from dual‑polarized Sentinel‑1 SAR imagery under strict hardware and power constraints. Trained on the AI4Arctic dataset, TinyIceNet combines SAR‑aware architectural simplifications with low‑precision quantization to balance accuracy and efficiency. The model is synthesized using High‑Level Synthesis and deployed on a Xilinx Zynq UltraScale+ FPGA platform, demonstrating near‑real‑time inference with significantly reduced energy consumption. Experimental results show that TinyIceNet achieves 75.216% F1 score on SOD segmentation while reducing energy consumption by 2x compared to full‑precision GPU baselines, underscoring the potential of chip‑level hardware‑algorithm co‑design for future spaceborne and edge AI systems.

Abstract:
Video Diffusion Transformers (DiTs) have been synthesizing high‑quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion‑related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per‑frame saliency maps for any text concept, including both motion and non‑motion. Second, we propose a motion‑feature selection algorithm to obtain an Interpretable Motion‑Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero‑shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non‑motion concepts.

Abstract:
Transparent object instance segmentation presents significant challenges in computer vision, due to the inherent properties of transparent objects, including boundary blur, low contrast, and high dependence on background context. Existing methods often fail as they depend on strong appearance cues and clear boundaries. To address these limitations, we propose SEP‑YOLO, a novel framework that integrates a dual‑domain collaborative mechanism for transparent object instance segmentation. Our method incorporates a Frequency Domain Detail Enhancement Module, which separates and enhances weak highfrequency boundary components via learnable complex weights. We further design a multi‑scale spatial refinement stream, which consists of a Content‑Aware Alignment Neck and a Multi‑scale Gated Refinement Block, to ensure precise feature alignment and boundary localization in deep semantic features. We also provide high‑quality instance‑level annotations for the Trans10K dataset, filling the critical data gap in transparent object instance segmentation. Extensive experiments on the Trans10K and GVD datasets show that SEP‑YOLO achieves state‑of‑the‑art (SOTA) performance.

Abstract:
Multimodal semantic segmentation integrates complementary information from diverse sensors for remote sensing Earth observation. However, practical systems often encounter missing modalities due to sensor failures or incomplete coverage, termed Incomplete Multimodal Semantic Segmentation (IMSS). IMSS faces three key challenges: (1) multimodal imbalance, where dominant modalities suppress fragile ones; (2) intra‑class variation in scale, shape, and orientation across modalities; and (3) cross‑modal heterogeneity with conflicting cues producing inconsistent semantic responses. Existing methods rely on contrastive learning or joint optimization, which risk over‑alignment, discarding modality‑specific cues or imbalanced training, favoring robust modalities, while largely overlooking intra‑class variation and cross‑modal heterogeneity. To address these limitations, we propose the Semantic‑Guided Modality‑Aware (SGMA) framework, which ensures balanced multimodal learning while reducing intra‑class variation and reconciling cross‑modal inconsistencies through semantic guidance. SGMA introduces two complementary plug‑and‑play modules: (1) Semantic‑Guided Fusion (SGF) module extracts multi‑scale, class‑wise semantic prototypes that capture consistent categorical representations across modalities, estimates per‑modality robustness based on prototype‑feature alignment, and performs adaptive fusion weighted by robustness scores to mitigate intra‑class variation and cross‑modal heterogeneity; (2) Modality‑Aware Sampling (MAS) module leverages robustness estimations from SGF to dynamically reweight training samples, prioritizing challenging samples from fragile modalities to address modality imbalance. Extensive experiments across multiple datasets and backbones demonstrate that SGMA consistently outperforms state‑of‑the‑art methods, with particularly significant improvements in fragile modalities.

Abstract:
Earth observation machine learning pipelines differ fundamentally from standard computer vision workflows. Imagery is typically delivered as large, georeferenced scenes, labels may be raster masks or vector geometries in distinct coordinate reference systems, and both training and evaluation often require spatially aware sampling and splitting strategies. TorchGeo is a PyTorch‑based domain library that provides datasets, samplers, transforms and pre‑trained models with the goal of making it easy to use geospatial data in machine learning pipelines. In this paper, we introduce a tutorial that demonstrates 1.) the core TorchGeo abstractions through code examples, and 2.) an end‑to‑end case study on multispectral water segmentation from Sentinel‑2 imagery using the Earth Surface Water dataset. This demonstrates how to train a semantic segmentation model using TorchGeo datasets, apply the model to a Sentinel‑2 scene over Rio de Janeiro, Brazil, and save the resulting predictions as a GeoTIFF for further geospatial analysis. The tutorial code itself is distributed as two Python notebooks: https://torchgeo.readthedocs.io/en/stable/tutorials/torchgeo.html and https://torchgeo.readthedocs.io/en/stable/tutorials/earth_surface_water.html.

Abstract:
Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine‑tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio‑temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training‑Free Spatio‑temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory (SDAM). We aim to design a training‑free reasoning video segmentation framework that outperforms existing methods requiring fine‑tuning, using only pre‑trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio‑temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross‑frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref‑YouTubeVOS, Ref‑DAVIS17, MeViS, ReasonVOS, and ReVOS.

Abstract:
Semantic segmentation takes pivotal roles in various applications such as autonomous driving and medical image analysis. When deploying segmentation models in practice, it is critical to test their behaviors in varied and complex scenes in advance. In this paper, we construct an automatic data generation pipeline Gen4Seg to stress‑test semantic segmentation models by generating various challenging samples with different attribute changes. Beyond previous evaluation paradigms focusing solely on global weather and style transfer, we investigate variations in both appearance and geometry attributes at the object and image level. These include object color, material, size, position, as well as image‑level variations such as weather and style. To achieve this, we propose to edit visual attributes of existing real images with precise control of structural information, empowered by diffusion models. In this way, the existing segmentation labels can be reused for the edited images, which greatly reduces the labor costs. Using our pipeline, we construct two new benchmarks, Pascal‑EA and COCO‑EA. We benchmark a wide variety of semantic segmentation models, spanning from closed‑set models to open‑vocabulary large models. We have several key findings: 1) advanced open‑vocabulary models do not exhibit greater robustness compared to closed‑set methods under geometric variations; 2) data augmentation techniques, such as CutOut and CutMix, are limited in enhancing robustness against appearance variations; 3) our pipeline can also be employed as a data augmentation tool and improve both in‑distribution and out‑of‑distribution performances. Our work suggests the potential of generative models as effective tools for automatically analyzing segmentation models, and we hope our findings will assist practitioners and researchers in developing more robust and reliable segmentation models.

Abstract:
Recently, an audio‑visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real‑world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio‑visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance‑based instance association fails to distinguish between an object's sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio‑Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per‑frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio‑following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state‑of‑the‑art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real‑time processing.

Abstract:
Aerial imagery is critical for large‑scale post‑disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross‑event domain shift, while supervised approaches still rely on costly, task‑specific annotations with limited coverage across disaster types and regions. Recent open‑vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task‑specific annotations. Instead, they leverage large‑scale pretraining and vision‑language representations. These properties are particularly relevant for post‑disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open‑vocabulary vision models for post‑disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade‑offs between different learning paradigms, providing insight into their applicability for real‑world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.

Abstract:
Open‑world promptable 3D semantic segmentation remains brittle as semantics are inferred in the input sensor coordinates. Yet, humans, in contrast, interpret parts via functional roles in a canonical space ‑‑ wings extend laterally, handles protrude to the side, and legs support from below. Psychophysical evidence shows that we mentally rotate objects into canonical frames to reveal these roles. To fill this gap, we propose \methodName, which attains canonical space perception by inducing a latent canonical reference frame learned directly from data. By construction, we create a unified canonical dataset through LLM‑guided intra‑ and cross‑category alignment, exposing canonical spatial regularities across 200 categories. By induction, we realize canonicality inside the model through a dual‑branch architecture with canonical map anchoring and canonical box calibration, collapsing pose variation and symmetry into a stable canonical embedding. This shift from input pose space to canonical embedding yields far more stable and transferable part semantics. Experimental results show that \methodName establishes new state of the art in open‑world promptable 3D segmentation.

Abstract:
Adverse weather conditions significantly degrade the performance of LiDAR point cloud semantic segmentation networks by introducing large distribution shifts. Existing augmentation‑based methods attempt to enhance robustness by simulating weather interference during training. However, they struggle to fully exploit the potential of augmentations due to the trade‑off between minor and aggressive augmentations. To address this, we propose A3Point, an adaptive augmentation‑aware latent learning framework that effectively utilizes a diverse range of augmentations while mitigating the semantic shift, which refers to the change in the semantic meaning caused by augmentations. A3Point consists of two key components: semantic confusion prior (SCP) latent learning, which captures the model's inherent semantic confusion information, and semantic shift region (SSR) localization, which decouples semantic confusion and semantic shift, enabling adaptive optimization strategies for different disturbance levels. Extensive experiments on multiple standard generalized LiDAR segmentation benchmarks under adverse weather demonstrate the effectiveness of our method, setting new state‑of‑the‑art results.

Abstract:
Spiking neural networks (SNNs) offer an energy‑efficient alternative to traditional neural networks due to their event‑driven computing paradigm. However, recent advancements in spiking transformers have focused on improving accuracy with large‑scale architectures, which require significant computational resources and limit deployment on resource‑constrained devices. In this paper, we propose a simple yet effective token pruning method for spiking transformers, termed TP‑Spikformer, that reduces storage and computational overhead while maintaining competitive performance. Specifically, we first introduce a heuristic spatiotemporal information‑retaining criterion that comprehensively evaluates tokens' importance, assigning higher scores to informative tokens for retention and lower scores to uninformative ones for pruning. Based on this criterion, we propose an information‑retaining token pruning framework that employs a block‑level early stopping strategy for uninformative tokens, instead of removing them outright. This also helps preserve more information during token pruning. We demonstrate the effectiveness, efficiency and scalability of TP‑Spikformer through extensive experiments across diverse architectures, including Spikformer, QKFormer and Spike‑driven Transformer V1 and V3, and a range of tasks such as image classification, object detection, semantic segmentation and event‑based object tracking. Particularly, TP‑Spikformer performs well in a training‑free manner. These results reveal its potential as an efficient and practical solution for deploying SNNs in real‑world applications with limited computational resources.

Abstract:
Accurate semantic segmentation of foot ulcers is essential for automated wound monitoring, yet boundary delineation remains challenging due to tissue heterogeneity and poor contrast with surrounding skin. To overcome the limitations of standard intensity‑based networks, we present LSS‑LTCNet:an ante‑hoc explainable framework synergizing deterministic structural priors with continuous‑time neural dynamics. Our architecture departs from traditional black‑box models by employing a Local Self‑Similarity (LSS) mechanism that extracts dense, illumination‑invariant texture descriptors to explicitly disentangle necrotic tissue from background artifacts. To enforce topological precision, we introduce a Liquid Time‑Constant (LTC) refinement module that treats boundary evolution as an ODEgoverned dynamic system, iteratively refining masks over continuous time‑steps. Comprehensive evaluation on the MICCAI FUSeg dataset demonstrates that LSS‑LTCNet achieves state‑of‑the‑art boundary alignment, securing a peak Dice score of 86.96% and an exceptional 95th percentile Hausdorff Distance (HD95) of 8.91 pixels. Requiring merely 25.70M parameters, the model significantly outperforms heavier U‑Net and transformer baselines in efficiency. By providing inherent visual audit trails alongside high‑fidelity predictions, LSS‑LTCNet offers a robust and transparent solution for computer‑aided diagnosis in mobile healthcare (mHealth) settings.

Abstract:
X‑ray computed tomography (CT) is a widely used imaging technique that provides detailed examinations into the internal structure of an object with synchrotron CT (SR‑CT) enabling improved data quality by using higher energy, monochromatic X‑rays. While SR‑CT allows for improved resolution, time‑resolved experimentation, and reduced imaging artifacts, it also produces significantly larger datasets than conventional CT. Accurate and efficient evaluation of these datasets is a critical component of these workflows; yet is often done manually representing a major bottleneck in the analysis phase. While deep learning has emerged as a powerful tool capable of providing a wide range of purely data‑driven solutions, it requires a substantial amount of labeled data for training and manual annotation of SR‑CT datasets is impractical in practice. In this paper, we introduce a novel framework that enables automatic segmentation of large, high‑resolution SR‑CT datasets by eliminating the need to hand label images for deep learning training. First, we generate pseudo labels by clustering on the voxel values identifying regions in the volume with similar attenuation coefficients producing an initial semantic map. Afterwards, we train a segmentation model on the pseudo labels before utilizing the Unbiased Teacher approach to self‑correct them ensuring accurate final segmentations. We find our approach improves pixel‑wise accuracy and mIoU by 13.31% and 15.94%, respectively, over the baseline pseudo labels when using a magnesium crystal SR‑CT sample. Additionally, we extensively evaluate the different components of our workflow including segmentation model, loss function, pseudo labeling strategy, and input type. Finally, we evaluate our approach on to two additional samples highlighting our frameworks ability to produce segmentations that are considerably better than the original pseudo labels.

Abstract:
This work tests a self‑annotation‑based unsupervised methodology for training a convolutional neural network (CNN) model for semantic segmentation of X‑ray computed tomography (XCT) scans of concretes. Concrete poses a unique challenge for XCT imaging due to similar X‑ray attenuation coefficients of aggregates and mortar, resulting in low‑contrast between the two phases in the ensuing images. While CNN‑based models are a proven technique for semantic segmentation in such challenging cases, they typically require labeled training data, which is often unavailable for new datasets or are costly to obtain. To counter that limitation, a self‑annotation technique is used here which leverages superpixel algorithms to identify perceptually similar local regions in an image and relates them to the global context in the image by utilizing the receptive field of a CNN‑based model. This enables the model to learn a global‑local relationship in the images and enables identification of semantically similar structures. We therefore present the performance of the unsupervised training methodology on our XCT datasets and discuss potential avenues for further improvements.

Abstract:
Occlusion remains a critical challenge in robotic fruit harvesting, as undetected or inaccurately localised fruits often results in substantial crop losses. To mitigate this issue, we propose a harvesting framework using a new amodal segmentation model, GDA‑YOLO11, which incorporates architectural improvements and an updated asymmetric mask loss. The proposed model is trained on a modified version of a public citrus dataset and evaluated on both the base dataset and occlusion‑sensitive subsets with varying occlusion levels. Within the framework, full fruit masks, including invisible regions, are inferred by GDA‑YOLO11, and picking points are subsequently estimated using the Euclidean distance transform. These points are then projected into 3D coordinates for robotic harvesting execution. Experiments were conducted using real citrus fruits in a controlled environment simulating occlusion scenarios. Notably, to the best of our knowledge, this study provides the first practical demonstration of amodal instance segmentation in robotic fruit harvesting. GDA‑YOLO11 achieves a precision of 0.844, recall of 0.846, mAP@50 of 0.914, and mAP@50:95 of 0.636, outperforming YOLO11n by 5.1%, 1.3%, and 1.0% in precision, mAP@50, and mAP@50:95, respectively. The framework attains harvesting success rates of 92.59%, 85.18%, 48.14%, and 22.22% at zero to high occlusion levels, improving success by 3.5% under medium and high occlusion. These findings demonstrate that GDA‑YOLO11 enhances occlusion robust segmentation and streamlines perception‑to‑action integration, paving the way for more reliable autonomous systems in agriculture.

Abstract:
In this paper, we propose ReSeg‑CLIP, a new training‑free Open‑Vocabulary Semantic Segmentation method for remote sensing data. To compensate for the problems of vision language models, such as CLIP in semantic segmentation caused by inappropriate interactions within the self‑attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS‑specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state‑of‑the‑art results across three RS benchmarks without additional training.

Abstract:
Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self‑supervised 4D models. A promising alternative is to transfer 3D pre‑trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter‑efficient transfer learning into two sequential stages. Optimal‑transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point‑video adapter and a spatial‑context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering‑oriented designs, PointATA enables a pre‑trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine‑tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 % accuracy on 3D action recognition, +8.7 % on 4 D action segmentation, and 84.06% on 4D semantic segmentation.

Abstract:
Panoramic semantic segmentation models are typically trained under a strict gravity‑aligned assumption. However, real‑world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation‑robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature‑consistent spherical attention that accounts for non‑uniform sampling densities; and (3) a gauge‑aware relative positional mechanism that encodes local angular geometry using tangent‑plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index‑based spherical resampling together with a logit‑level SO(3)‑consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within \pm 35^\circ. Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.

Abstract:
Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale‑at‑all‑costs" paradigm. However, this strategy faces two critical challenges: large‑scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over‑represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost‑effective alternative to brute‑force dataset enlargement. We introduce CheXficient, a chest X‑ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full‑data counterpart and other large‑scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non‑adapted off‑the‑shelf evaluations (zero‑shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under‑represented training samples, improving generalizability on long‑tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision‑language foundation models.

Abstract:
Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi‑modality model that provides the intelligent abilities to instruction‑based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi‑modality chain of thought prompts, i.e., Chain‑of‑Thought (CoT) planning, editing region reasoning, and editing. For Chain‑of‑Thought planning, the large language model could reason the appropriate sub‑prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction‑based editing region generation network with a multi‑modal large language model. Finally, a hint‑guided instruction‑based editing network is proposed for editing image generations based on the sizeable text‑to‑image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real‑world images.

Abstract:
Ego‑centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego‑motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label‑free, teacher‑guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self‑supervised approaches that focus primarily on frame‑to‑frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi‑modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi‑modal teachers provide sequence‑level pseudo‑supervision, enabling LFG to learn a unified pseudo‑4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi‑camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion‑aware features position LFG as a compelling video‑centric foundation model for autonomous driving.

Abstract:
Accurate annotation of endoscopic videos is essential yet time‑consuming, particularly for challenging datasets such as dysplasia in Barrett's esophagus, where the affected regions are irregular and lack clear boundaries. Semi‑automatic tools like Segment Anything Model 2 (SAM2) can ease this process by propagating annotations across frames, but small errors often accumulate and reduce accuracy, requiring expert review and correction. To address this, we systematically study how annotation errors propagate across different prompt types, namely masks, boxes, and points, and propose Learning‑to‑Re‑Prompt (L2RP), a cost‑aware framework that learns when and where to seek expert input. By tuning a human‑cost parameter, our method balances annotation effort and segmentation accuracy. Experiments on a private Barrett's dysplasia dataset and the public SUN‑SEG benchmark demonstrate improved temporal consistency and superior performance over baseline strategies.

Abstract:
Single‑point annotation is increasingly prominent in visual tasks for labeling cost reduction. However, it challenges tasks requiring high precision, such as the point‑prompted instance segmentation (PPIS) task, which aims to estimate precise masks using single‑point prompts to train a segmentation network. Due to the constraints of point annotations, granularity ambiguity and boundary uncertainty arise the difficulty distinguishing between different levels of detail (eg. whole object vs. parts) and the challenge of precisely delineating object boundaries. Previous works have usually inherited the paradigm of mask generation along with proposal selection to achieve PPIS. However, proposal selection relies solely on category information, failing to resolve the ambiguity of different granularity. Furthermore, mask generators offer only finite discrete solutions that often deviate from actual masks, particularly at boundaries. To address these issues, we propose the Semantic‑Aware Point‑Prompted Instance Segmentation Network (SAPNet). It integrates Point Distance Guidance and Box Mining Strategy to tackle group and local issues caused by the point's granularity ambiguity. Additionally, we incorporate completeness scores within proposals to add spatial granularity awareness, enhancing multiple instance learning (MIL) in proposal selection termed S‑MIL. The Multi‑level Affinity Refinement conveys pixel and semantic clues, narrowing boundary uncertainty during mask refinement. These modules culminate in SAPNet++, mitigating point prompt's granularity ambiguity and boundary uncertainty and significantly improving segmentation performance. Extensive experiments on four challenging datasets validate the effectiveness of our methods, highlighting the potential to advance PPIS.

Abstract:
Accurate forest stand delineation is essential for forest inventory and management but remains a largely manual and subjective process. A recent study has shown that deep learning can produce stand delineations comparable to expert interpreters when combining aerial imagery and airborne laser scanning (ALS) data. However, temporal misalignment between data sources limits operational scalability. Canopy height models (CHMs) derived from digital photogrammetry (DAP) offer better temporal alignment but may smoothen canopy surface and canopy gaps, raising the question of whether they can reliably replace ALS‑derived CHMs. Similarly, the inclusion of a digital terrain model (DTM) has been suggested to improve delineation performance, but has remained untested in published literature. Using expert‑delineated forest stands as reference data, we assessed a U‑Net‑based semantic segmentation framework with municipality‑level cross‑validation across six municipalities in southeastern Norway. We compared multispectral aerial imagery combined with (i) an ALS‑derived CHM, (ii) a DAP‑derived CHM, and (iii) a DAP‑derived CHM in combination with a DTM. Results showed comparable performance across all data combinations, reaching overall accuracy values between 0.90‑0.91. Agreement between model predictions was substantially larger than agreement with the reference data, highlighting both model consistency and the inherent subjectivity of stand delineation. The similar performance of DAP‑CHMs, despite the reduced structural detail, and the lack of improvements of the DTM indicate that the framework is resilient to variations in input data. These findings indicate that large datasets for deep learning‑based stand delineations can be assembled using projects including temporally aligned ALS data and DAP point clouds.

Abstract:
While existing unsupervised domain adaptation (UDA) methods greatly enhance target domain performance in semantic segmentation, they often neglect network calibration quality, resulting in misalignment between prediction confidence and actual accuracy ‑‑ a significant risk in safety‑critical applications. Our key insight emerges from observing that performance degrades substantially when soft pseudo‑labels replace hard pseudo‑labels in cross‑domain scenarios due to poor calibration, despite the theoretical equivalence of perfectly calibrated soft pseudo‑labels to hard pseudo‑labels. Based on this finding, we propose DA‑Cal, a dedicated cross‑domain calibration framework that transforms target domain calibration into soft pseudo‑label optimization. DA‑Cal introduces a Meta Temperature Network to generate pixel‑level calibration parameters and employs bi‑level optimization to establish the relationship between soft pseudo‑labels and UDA supervision, while utilizing complementary domain‑mixing strategies to prevent overfitting and reduce domain discrepancies. Experiments demonstrate that DA‑Cal seamlessly integrates with existing self‑training frameworks across multiple UDA segmentation benchmarks, significantly improving target domain calibration while delivering performance gains without inference overhead. The code will be released.

Abstract:
This study details an artificial intelligence (AI)‑based methodology for the semantic segmentation of space camera faults. Specifically, we address the segmentation of straylight effects induced by solar presence around the camera's Field of View (FoV). Anomalous images are sourced from our published dataset. Our approach emphasizes generalization across diverse flare textures, leveraging pre‑training on a public dataset (Flare7k++) including flares in various non‑space contexts to mitigate the scarcity of realistic space‑specific data. A DeepLabV3 model with MobileNetV3 backbone performs the segmentation task. The model design targets deployment in spacecraft resource‑constrained hardware. Finally, based on a proposed interface between our model and the onboard navigation pipeline, we develop custom metrics to assess the model's performance in the system‑level context.

Abstract:
Verbal‑prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance‑specific, or difficult‑to‑describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance‑based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD‑prompted segmentation framework built on SAM3 that uses canonical multi‑view renderings of a CAD model as prompt input. The rendered views provide geometry‑based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single‑stage, CAD‑prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

Abstract:
Accurate per‑branch 3D reconstruction is a prerequisite for autonomous UAV‑based tree pruning; however, dense disparity maps from modern stereo matchers often remain too noisy for individual branch analysis in complex forest canopies. This paper introduces a progressive pipeline integrating DEFOM‑Stereo foundation‑model disparity estimation, SAM3 instance segmentation, and multi‑stage depth optimization to deliver robust per‑branch point clouds. Starting from a naive baseline, we systematically identify and resolve three error families through successive refinements. Mask boundary contamination is first addressed through morphological erosion and subsequently refined via a skeleton‑preserving variant to safeguard thin‑branch topology. Segmentation inaccuracy is then mitigated using LAB‑space Mahalanobis color validation coupled with cross‑branch overlap arbitration. Finally, depth noise ‑ the most persistent error source ‑ is initially reduced by outlier removal and median filtering, before being superseded by a robust five‑stage scheme comprising MAD global detection, spatial density consensus, local MAD filtering, RGB‑guided filtering, and adaptive bilateral filtering. Evaluated on 1920x1080 stereo imagery of Radiata pine (Pinus radiata) acquired with a ZED Mini camera (63 mm baseline) from a UAV in Canterbury, New Zealand, the proposed pipeline reduces the average per‑branch depth standard deviation by 82% while retaining edge fidelity. The result is geometrically coherent 3D point clouds suitable for autonomous pruning tool positioning. All code and processed data are publicly released to facilitate further UAV forestry research.

Abstract:
Current zero‑shot Camouflaged Object Segmentation methods typically employ a two‑stage pipeline (discover‑then‑segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the Discover‑Segment‑Select (DSS) mechanism, a progressive framework designed to refine segmentation step by step. The proposed method contains a Feature‑coherent Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic‑driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Without requiring any training or supervision, DSS achieves state‑of‑the‑art performance on multiple COS benchmarks, especially in multiple‑instance scenes.

Abstract:
While deep learning has demonstrated considerable promise in computer‑aided diagnosis for pulmonary embolism (PE), practical deployment in Computed Tomography Pulmonary Angiography (CTPA) is often hindered by "domain shift" and the prohibitive cost of expert annotations. To address these challenges, an unsupervised domain adaptation (UDA) framework is proposed, utilizing a Transformer backbone and a Mean‑Teacher architecture for cross‑center semantic segmentation. The primary focus is placed on enhancing pseudo‑label reliability by learning deep structural information within the feature space. Specifically, three modules are integrated and designed for this task: (1) a Prototype Alignment (PA) mechanism to reduce category‑level distribution discrepancies; (2) Global and Local Contrastive Learning (GLCL) to capture both pixel‑level topological relationships and global semantic representations; and (3) an Attention‑based Auxiliary Local Prediction (AALP) module designed to reinforce sensitivity to small PE lesions by automatically extracting high‑information slices from Transformer attention maps. Experimental validation conducted on cross‑center datasets (FUMPE and CAD‑PE) demonstrates significant performance gains. In the FUMPE ‑> CAD‑PE task, the IoU increased from 0.1152 to 0.4153, while the CAD‑PE ‑> FUMPE task saw an improvement from 0.1705 to 0.4302. Furthermore, the proposed method achieved a 69.9% Dice score in the CT ‑> MRI cross‑modality task on the MMWHS dataset without utilizing any target‑domain labels for model selection, confirming its robustness and generalizability for diverse clinical environments.

Abstract:
While diffusion models have achieved state‑of‑the‑art performance in Image Super‑Resolution (SR), their prohibitive computational and memory demands restrict their training and inference to fixed‑size inputs. The standard workaround to super‑resolve larger images relies on partitioning the image, super‑resolving patches independently, and stitching them together ‑‑ a process that inevitably introduces severe boundary artifacts and spatial inconsistencies in large‑scale scenes. To achieve spatially continuous, arbitrary‑size image super‑resolution, we propose InfScene‑SR, a diffusion‑based SR approach. Building upon SR3, our approach leverages Variance‑Corrected Fusion (VCF) to perform joint‑denoising across overlapping patches. VCF guarantees continuous transitions while preserving the stochastic variance crucial for high‑fidelity texture reconstruction. To overcome the prohibitive synchronization overhead of scaling joint‑denoising to gigapixel imagery, we introduce Spatially‑Decoupled Variance Correction (SDVC). SDVC reformulates the global fusion process into independent, atomic patch operations, drastically reducing memory complexity to \mathcalO(1) and naturally enabling fully distributed, parallelized inference. Extensive experiments on large‑scale remote sensing datasets demonstrate that InfScene‑SR strictly eliminates boundary seams, achieves superior perceptual quality, and significantly boosts performance in downstream semantic segmentation task.

Abstract:
Semantic segmentation of 3D LiDAR point clouds is important in urban remote sensing for understanding real‑world street environments. This task, by projecting LiDAR point clouds and 3D semantic labels as sparse maps, can be reformulated as a 2D problem. However, the intrinsic sparsity of the projected LiDAR and label maps can result in sparse and inaccurate intermediate 2D semantic predictions, which in return limits the final 3D accuracy. To address this issue, we enhance this task by shaping dense and accurate 2D predictions. Specifically, we develop a multi‑modal segmentation model, MM2D3D. By leveraging camera images as auxiliary data, we introduce cross‑modal guided filtering to overcome label map sparsity by constraining intermediate 2D semantic predictions with dense semantic relations derived from the camera images; and we introduce dynamic cross pseudo supervision to overcome LiDAR map sparsity by encouraging the 2D predictions to emulate the dense distribution of the semantic predictions from the camera images. Experiments show that our techniques enable our model to achieve intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy. Comparisons with previous methods demonstrate our superior performance in both 2D and 3D spaces.

Abstract:
Domain Generalization in Semantic Segmentation (DG‑SS) aims to enable segmentation models to perform robustly in unseen environments. However, conventional DG‑SS methods are restricted to a fixed set of known categories, limiting their applicability in open‑world scenarios. Recent progress in Vision‑Language Models (VLMs) has advanced Open‑Vocabulary Semantic Segmentation (OV‑SS) by enabling models to recognize a broader range of concepts. Yet, these models remain sensitive to domain shifts and struggle to maintain robustness when deployed in unseen environments, a challenge that is particularly severe in urban‑driving scenarios. To bridge this gap, we introduce Open‑Vocabulary Domain Generalization in Semantic Segmentation (OVDG‑SS), a new setting that jointly addresses unseen domains and unseen categories. We introduce the first benchmark for OVDG‑SS in autonomous driving, addressing a previously unexplored problem and covering both synthetic‑to‑real and real‑to‑real generalization across diverse unseen domains and unseen categories. In OVDG‑SS, we observe that domain shifts often distort text‑image correlations in pre‑trained VLMs, which hinders the performance of OV‑SS models. To tackle this challenge, we propose S2‑Corr, a state‑space‑driven text‑image correlation refinement mechanism that mitigates domain‑induced distortions and produces more consistent text‑image correlations under distribution changes. Extensive experiments on our constructed benchmark demonstrate that the proposed method achieves superior cross‑domain performance and efficiency compared to existing OV‑SS approaches.

Abstract:
In recent years, foundation models such as CLIP, DINO,and CONCH have demonstrated remarkable domain generalization and unsupervised feature extraction capabilities across diverse imaging tasks. However, systematic and independent evaluations of these models for pixel‑level semantic segmentation in histopathology remain scarce. In this study, we propose a robust benchmarking approach to asses 10 foundational models on four histopathological datasets covering both morphological tissue‑region and cellular/nuclear segmentation tasks. Our method leverages attention maps of foundation models as pixel‑wise features, which are then classified using a machine learning algorithm, XGBoost, enabling fast, interpretable, and model‑agnostic evaluation without finetuning. We show that the vision language foundation model, CONCH performed the best across datasets when compared to vision‑only foundation models, with PathDino as close second. Further analysis shows that models trained on distinct histopathology cohorts capture complementary morphological representations, and concatenating their features yields superior segmentation performance. Concatenating features from CONCH, PathDino and CellViT outperformed individual models across all the datasets by 7.95% (averaged across the datasets), suggesting that ensembles of foundation models can better generalize to diverse histopathological segmentation tasks.

Abstract:
Dense Bird's Eye View (BEV) semantic maps are central to autonomous driving, yet current multi‑camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two‑phase training strategy for fine‑grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine‑tuning while still outperforming the comparable supervised baseline model. During the self‑supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi‑view semantic pseudo‑labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine‑tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine‑tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced‑label autonomous perception.

Abstract:
Class imbalance induces systematic bias in deep neural networks by imposing a skewed effective class prior. This work introduces the Neural Prior Estimator (NPE), a framework that learns feature‑conditioned log‑prior estimates from latent representations. NPE employs one or more Prior Estimation Modules trained jointly with the backbone via a one‑way logistic loss. Under the Neural Collapse regime, NPE is analytically shown to recover the class log‑prior up to an additive constant, providing a theoretically grounded adaptive signal without requiring explicit class counts or distribution‑specific hyperparameters. The learned estimate is incorporated into logit adjustment, forming NPE‑LA, a principled mechanism for bias‑aware prediction. Experiments on long‑tailed CIFAR and imbalanced semantic segmentation benchmarks (STARE, ADE20K) demonstrate consistent improvements, particularly for underrepresented classes. NPE thus offers a lightweight and theoretically justified approach to learned prior estimation and imbalance‑aware prediction.

Abstract:
Existing online video segmentation models typically combine a per‑frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large‑scale pre‑training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder‑only Mask Transformer (VidEoMT), a simple encoder‑only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder‑only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally‑agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x‑10x faster, running at up to 160 FPS with a ViT‑L backbone. Code: https://www.tue‑mps.org/videomt/

Abstract:
Existing aerial robot navigation systems typically plan paths around static and dynamic obstacles, but fail to adapt when a static obstacle suddenly moves. Integrating environmental semantic awareness enables estimation of potential risks posed by suddenly moving obstacles. In this paper, we propose RA‑ Nav, a risk‑aware navigation framework based on semantic segmentation. A lightweight multi‑scale semantic segmentation network identifies obstacle categories in real time. These obstacles are further classified into three types: stationary, temporarily static, and dynamic. For each type, corresponding risk estimation functions are designed to enable real‑time risk prediction, based on which a complete local risk map is constructed. Based on this map, the risk‑informed path search algorithm is designed to guarantee planning that balances path efficiency and safety. Trajectory optimization is then applied to generate trajectories that are safe, smooth, and dynamically feasible. Comparative simulations demonstrate that RA‑Nav achieves higher success rates than baselines in sudden obstacle state transition scenarios. Its effectiveness is further validated in simulations using real‑ world data.

Abstract:
Day‑to‑night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel‑level supervision. Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man‑made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance. We propose a novel framework that detects and suppresses hallucinations of target‑class features during unpaired translation. To detect hallucination, we design a dual‑head discriminator that additionally performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class‑specific prototypes, constructed by aggregating features of annotated target‑domain objects, which act as semantic anchors for each class. Built upon a Schrodinger Bridge‑based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation trajectory.Experiments show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day‑to‑night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.

Abstract:
High‑quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real‑world video datasets suffer from annotation errors such as mislabeling, where segments are assigned incorrect class labels, and disordering, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase‑annotated tasks, where temporal consistency is critical. We propose a novel, model‑agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)‑‑defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per‑frame loss trajectory acts as a dynamic fingerprint of frame‑level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video‑based machine learning.

Abstract:
Gliomas, among the most common primary brain tumors, vary widely in aggressiveness, prognosis, and histology, making treatment challenging due to complex and time‑intensive surgical interventions. This study presents an Attention‑Gated Recurrent Residual U‑Net (R2U‑Net) based Triplanar (2.5D) model for improved brain tumor segmentation. The proposed model enhances feature representation and segmentation accuracy by integrating residual, recurrent, and triplanar architectures while maintaining computational efficiency, potentially aiding in better treatment planning. The proposed method achieves a Dice Similarity Score (DSC) of 0.900 for Whole Tumor (WT) segmentation on the BraTS2021 validation set, demonstrating performance comparable to leading models. Additionally, the triplanar network extracts 64 features per planar model for survival days prediction, which are reduced to 28 using an Artificial Neural Network (ANN). This approach achieves an accuracy of 45.71%, a Mean Squared Error (MSE) of 108,318.128, and a Spearman Rank Correlation Coefficient (SRC) of 0.338 on the test dataset.

Abstract:
Continual learning remains constrained by the need for repeated retraining, high computational costs, and the persistent challenge of forgetting. These factors significantly limit the applicability of continuous learning in real‑world settings, as iterative model updates require significant computational resources and inherently exacerbate forgetting. We present SAILS ‑‑ Segment Anything with Incrementally Learned Semantics, a training‑free framework for Class‑Incremental Semantic Segmentation (CISS) that sidesteps these challenges entirely. SAILS leverages foundational models to decouple CISS into two stages: Zero‑shot region extraction using Segment Anything Model (SAM), followed by semantic association through prototypes in a fixed feature space. SAILS incorporates selective intra‑class clustering, resulting in multiple prototypes per class to better model intra‑class variability. Our results demonstrate that, despite requiring no incremental training, SAILS typically surpasses the performance of existing training‑based approaches on standard CISS datasets, particularly in long and challenging task sequences where forgetting tends to be most severe. By avoiding parameter updates, SAILS completely eliminates forgetting and maintains consistent, task‑invariant performance. Furthermore, SAILS exhibits positive backward transfer, where the introduction of new classes can enhance performance on previous classes.

Abstract:
You Only Look Once (YOLO) has been the prominent model for computer vision in deep learning for a decade. This study explores the novel aspects of YOLO26, the most recent version in the YOLO series. The elimination of Distribution Focal Loss (DFL), implementation of End‑to‑End NMS‑Free Inference, introduction of ProgLoss + Small‑Target‑Aware Label Assignment (STAL), and use of the MuSGD optimizer are the primary enhancements designed to improve inference speed, which is claimed to achieve a 43% boost in CPU mode. This is designed to allow YOLO26 to attain real‑time performance on edge devices or those without GPUs. Additionally, YOLO26 offers improvements in many computer vision tasks, including instance segmentation, pose estimation, and oriented bounding box (OBB) decoding. We aim for this effort to provide more value than just consolidating information already included in the existing technical documentation. Therefore, we performed a rigorous architectural investigation into YOLO26, mostly using the source code available in its GitHub repository and its official documentation. The authentic and detailed operational mechanisms of YOLO26 are inside the source code, which is seldom extracted by others. The YOLO26 architectural diagram is shown as the outcome of the investigation. This study is, to our knowledge, the first one presenting the CNN‑based YOLO26 architecture, which is the core of YOLO26. Our objective is to provide a precise architectural comprehension of YOLO26 for researchers and developers aspiring to enhance the YOLO model, ensuring it remains the leading deep learning model in computer vision.

Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high‑level visual understanding. However, extending these models to fine‑grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task‑specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task‑specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision‑language benchmarks, demonstrating that a standard, general‑purpose MLLM can effectively support dense perception without architectural specialization.

Abstract:
Cooperative perception systems for autonomous driving aim to overcome the limited perception range of a single vehicle by communicating with adjacent agents to share sensing information. While this improves perception performance, these systems also face a significant privacy‑leakage issue, as sensitive visual content can potentially be reconstructed from the shared data. In this paper, we propose a novel Privacy‑Concealing Cooperation (PCC) framework for Bird's Eye View (BEV) semantic segmentation. Based on commonly shared BEV features, we design a hiding network to prevent an image reconstruction network from recovering the input images from the shared features. An adversarial learning mechanism is employed to train the network, where the hiding network works to conceal the visual clues in the BEV features while the reconstruction network attempts to uncover these clues. To maintain segmentation performance, the perception network is integrated with the hiding network and optimized end‑to‑end. The experimental results demonstrate that the proposed PCC framework effectively degrades the quality of the reconstructed images with minimal impact on segmentation performance, providing privacy protection for cooperating vehicles. The source code will be made publicly available upon publication.

Abstract:
Accurate and efficient modeling of indoor wireless signal propagation is crucial for the deployment of next‑generation Wi‑Fi. This paper presents a digital twin‑based measurement system that integrates real‑world 3D environment reconstruction with deterministic ray tracing for physically grounded electromagnetic modeling. Building geometry is obtained through LiDAR scanning, followed by object segmentation and assignment of ITU‑R standard material parameters. The propagation process is simulated with a GPU‑accelerated ray‑tracing engine that generates path‑level channel attributes, including delay, power, angular dispersion, and Ricean K‑factor. Under identical runtime constraints, the proposed system is evaluated against a commercial measurement simulator, demonstrating up to 21 dB higher path gain and consistently improved signal‑to‑interference‑plus‑noise ratio in line‑of‑sight conditions. Additionally, experiments against onsite RSSI measurements confirm a high spatial correlation of 0.98 after calibration, proving the system's fidelity in real‑world settings. Furthermore, coverage analysis across 2.4 GHz, 5 GHz, and 6 GHz bands demonstrates the capability of system to model frequency‑dependent material attenuation for Wi‑Fi 6E/7 networks. Finally, the system offers interactive 3D visualization and on‑demand data extraction, highlighting its potential for digital twin‑driven wireless system design and optimization.

Abstract:
Macroscopic traffic safety modeling aims to identify critical risk factors for regional crashes, thereby informing targeted policy interventions for safety improvement. However, current approaches rely heavily on static sociodemographic and infrastructure metrics, frequently overlooking the impacts from drivers' visual perception of driving environment. Although visual environment features have been found to impact driving and traffic crashes, existing evidence remains largely observational, failing to establish the robust causality for traffic policy evaluation under complex spatial environment. To fill these gaps, we applied semantic segmentation on Google Street View imageries to extract visual environmental features and proposed a Double Machine Learning framework to quantify their causal effects on regional crashes. Meanwhile, we utilized SHAP values to characterize the nonlinear influence mechanisms of confounding variables in the models and applied causal forests to estimate conditional average treatment effects. Leveraging crash records from the Miami metropolitan area, Florida, and 220,000 street view images, evidence shows that greenery proportion exerts a significant and robust negative causal effect on traffic crashes (Average Treatment Effect = ‑6.38, p = 0.005). This protective effect exhibits spatial heterogeneity, being most pronounced in densely populated and socially vulnerable urban cores. While greenery significantly mitigates angle and rear‑end crashes, its protective benefit for vulnerable road users (VRUs) remains limited. Our findings provide causal evidence for greening as a potential safety intervention, prioritizing hazardous visual environments while highlighting the need for distinct design optimizations to protect VRUs.

Abstract:
Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each other's live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness, agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature relevance, and a data‑driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation and 3D detection as example tasks. Across multiple open‑source cooperative segmentation and detection models, COOPERTRIM achieves up to 80.28% and 72.52% bandwidth reduction respectively while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Combined with compression strategies, COOPERTRIM can further reduce bandwidth usage to as low as 1.46% without compromising IoU performance. Qualitative results show COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility and paving the way for real‑world deployment.

Abstract:
Cloud‑based machine learning is increasingly explored as a preprocessing strategy for next‑generation visual neuroprostheses, where advanced scene understanding may exceed the computational and energy constraints of battery‑powered visual processing units. Offloading computation to remote servers enables the use of state‑of‑the‑art vision models, but also introduces sensitivity to network latency, jitter, and packet loss, which can disrupt the temporal consistency of the delivered neural stimulus. In this work, we examine the feasibility of cloud‑assisted visual preprocessing for artificial vision by framing remote inference as a perceptually constrained systems problem. We present a network‑adaptive cloud‑assisted pipeline in which real‑time round‑trip‑time feedback is used to dynamically modulate image resolution, compression, and transmission rate, explicitly prioritizing temporal continuity under adverse network conditions. PIDNet is used as a fixed real‑time semantic segmentation backbone, allowing us to isolate how network‑adaptive input encoding affects communication delay, inference time, and perceptual fidelity. Results show that adaptive visual encoding substantially reduces end‑to‑end latency during network congestion, with only modest degradation of global scene structure, while boundary precision degrades more sharply. Together, these findings delineate operating regimes in which cloud‑assisted preprocessing may remain viable for future visual neuroprostheses and underscore the importance of network‑aware adaptation for maintaining perceptual stability.

Abstract:
Mapping individual tree crowns is essential for tasks such as maintaining urban tree inventories and monitoring forest health, which help us understand and care for our environment. However, automatically separating the crowns from each other in aerial imagery is challenging due to factors such as the texture and partial tree crown overlaps. In this study, we present a method to train deep learning models that segment and separate individual trees from RGB and multispectral images, using pseudo‑labels derived from aerial laser scanning (ALS) data. Our study shows that the ALS‑derived pseudo‑labels can be enhanced using a zero‑shot instance segmentation model, Segment Anything Model 2 (SAM 2). Our method offers a way to obtain domain‑specific training annotations for optical image‑based models without any manual annotation cost, leading to segmentation models which outperform any available models which have been targeted for general domain deployment on the same task.

Abstract:
In vision‑language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial‑temporal video grounding (STVG). Prior approaches typically focus on enhancing visual‑textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per‑frame coordinate prediction as a compact instance‑level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG‑R1, the first reinforcement learning framework for STVG, which employs a task‑driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG‑R1 surpasses the baseline Qwen2.5‑VL‑7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG‑v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG‑R1 also exhibits strong zero‑shot generalization to multi‑object referring video object segmentation tasks, achieving a SOTA 47.3% J&F on MeViS.

Abstract:
Semantic segmentation of 3D point clouds is important for many applications, such as autonomous driving. To train semantic segmentation models, labeled point cloud segmentation datasets are essential. Meanwhile, point cloud labeling is time‑consuming for annotators, which typically involves tuning the camera viewpoint and selecting points by lasso. To reduce the time cost of point cloud labeling, we propose a viewpoint recommendation approach to reduce annotators' labeling time costs. We adapt Fitts' law to model the time cost of lasso selection in point clouds. Using the modeled time cost, the viewpoint that minimizes the lasso selection time cost is recommended to the annotator. We build a data labeling system for semantic segmentation of 3D point clouds that integrates our viewpoint recommendation approach. The system enables users to navigate to recommended viewpoints for efficient annotation. Through an ablation study, we observed that our approach effectively reduced the data labeling time cost. We also qualitatively compare our approach with previous viewpoint selection approaches on different datasets.

Abstract:
Background: Automated podocyte foot process quantification is vital for kidney research, but the established "Automatic Morphological Analysis of Podocytes" (AMAP) method is hindered by high computational demands, a lack of a user interface, and Linux dependency. We developed AMAP‑APP, a cross‑platform desktop application designed to overcome these barriers. Methods: AMAP‑APP optimizes efficiency by replacing intensive instance segmentation with classic image processing while retaining the original semantic segmentation model. It introduces a refined Region of Interest (ROI) algorithm to improve precision. Validation involved 365 mouse and human images (STED and confocal), benchmarking performance against the original AMAP via Pearson correlation and Two One‑Sided T‑tests (TOST). Results: AMAP‑APP achieved a 147‑fold increase in processing speed on consumer hardware. Morphometric outputs (area, perimeter, circularity, and slit diaphragm density) showed high correlation (r>0.90) and statistical equivalence (TOST P<0.05) to the original method. Additionally, the new ROI algorithm demonstrated superior accuracy compared to the original, showing reduced deviation from manual delineations. Conclusion: AMAP‑APP democratizes deep learning‑based podocyte morphometry. By eliminating the need for high‑performance computing clusters and providing a user‑friendly interface for Windows, macOS, and Linux, it enables widespread adoption in nephrology research and potential clinical diagnostics.

Abstract:
This work proposes MeCSAFNet, a multi‑branch encoder‑decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non‑visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high‑level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4‑channel (4c) input combining RGB and NIR bands, as well as a 6‑channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five‑Billion‑Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet‑base (6c) surpasses U‑Net (4c) by +19.21%, U‑Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet‑large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state‑of‑the‑art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource‑constrained environments.

Abstract:
Deep neural networks, especially transformer‑based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio‑Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi‑frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self‑attention to process spatio‑temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger‑scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single‑frame baselines. These results demonstrate STA as an effective architectural enhancement for video‑based semantic segmentation applications.

Abstract:
Current instance segmentation models achieve high performance on average predictions, but lack principled uncertainty quantification: their outputs are not calibrated, and there is no guarantee that a predicted mask is close to the ground truth. To address this limitation, we introduce a conformal prediction algorithm to generate adaptive confidence sets for instance segmentation. Given an image and a pixel coordinate query, our algorithm generates a confidence set of instance predictions for that pixel, with a provable guarantee for the probability that at least one of the predictions has high Intersection‑Over‑Union (IoU) with the true object instance mask. We apply our algorithm to instance segmentation examples in agricultural field delineation, cell segmentation, and vehicle detection. Empirically, we find that our prediction sets vary in size based on query difficulty and attain the target coverage, outperforming existing baselines such as Learn Then Test, Conformal Risk Control, and morphological dilation‑based methods. We provide versions of the algorithm with asymptotic and finite sample guarantees.

Abstract:
Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual‑language understanding, demonstrating superior high‑level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision‑centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well‑rounded vision transformer that instantiates a novel multi‑task framework for collaborative post‑training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi‑granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language‑mediated reasoning and pixel‑level understanding.

Abstract:
Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test‑time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal‑sampling shift break correspondence‑based propagation and fixed‑stride temporal aggregation, causing severe frame‑to‑frame flicker even in label‑stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio‑Temporal Memory Decoder that aggregates multi‑frame context into a clip‑level spatio‑temporal memory and decodes temporally consistent per‑frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross‑domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.

Abstract:
Object‑level segmentation in dynamic 4D Gaussian scenes remains challenging due to complex motion, occlusions, and ambiguous boundaries. In this paper, we present an efficient learning‑free 4D Gaussian segmentation framework that lifts video segmentation masks to 4D spaces, whose core is a two‑stage iterative boundary refinement, TIBR4D. The first stage is an Iterative Gaussian Instance Tracing (IGIT) at the temporal segment level. It progressively refines Gaussian‑to‑instance probabilities through iterative tracing, and extracts corresponding Gaussian point clouds that better handle occlusions and preserve completeness of object structures compared to existing one‑shot threshold‑based methods. The second stage is a frame‑wise Gaussian Rendering Range Control (RCC) via suppressing highly uncertain Gaussians near object boundaries while retaining their core contributions for more accurate boundaries. Furthermore, a temporal segmentation merging strategy is proposed for IGIT to balance identity consistency and dynamic awareness. Longer segments enforce stronger multi‑frame constraints for stable identities, while shorter segments allow identity changes to be captured promptly. Experiments on HyperNeRF and Neu3D demonstrate that our method produces accurate object Gaussian point clouds with clearer boundaries and higher efficiency compared to SOTA methods.

Abstract:
Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real‑world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image‑trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image‑level food segmentation data and evaluated on video sequences using an instance segmentation with tracking‑by‑matching framework, enabling object‑level temporal analysis. Our results reveal that high frame‑wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image‑based metrics, which substantially overestimate real‑world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post‑hoc temporal regularization and self‑supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image‑centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally‑aware learning and evaluation protocols for video‑based food analysis.

Abstract:
Open‑vocabulary semantic segmentation has become an important direction in remote sensing, as it enables recognition beyond predefined land‑cover categories. However, existing methods mainly depend on passive visual‑text matching and often struggle with semantic ambiguity in geographically complex scenes, especially when different classes exhibit similar spectral or structural patterns. To address this issue, we propose a Geospatial Reasoning Chain‑of‑Thought (GR‑CoT) framework for remote sensing open‑vocabulary semantic segmentation. GR‑CoT consists of an offline knowledge distillation stream and an online instance reasoning stream. The former constructs category interpretation standards for confusing classes, while the latter performs macro‑scenario anchoring, visual feature decoupling, and knowledge‑driven decision synthesis to generate an image‑adaptive vocabulary for downstream segmentation. Experiments on the LoveDA and GID5 benchmarks indicate that the proposed framework improves overall segmentation performance and yields more semantically coherent predictions in complex scenes.

Abstract:
Semantic segmentation in high‑resolution agricultural imagery demands models that strike a careful balance between accuracy and computational efficiency to enable deployment in practical systems. In this work, we propose DAS‑SK, a novel lightweight architecture that retrofits selective kernel convolution (SK‑Conv) into the dual atrous separable convolution (DAS‑Conv) module to strengthen multi‑scale feature learning. The model further enhances the atrous spatial pyramid pooling (ASPP) module, enabling the capture of fine‑grained local structures alongside global contextual information. Built upon a modified DeepLabV3 framework with two complementary backbones ‑ MobileNetV3‑Large and EfficientNet‑B3, the DAS‑SK model mitigates limitations associated with large dataset requirements, limited spectral generalization, and the high computational cost that typically restricts deployment on UAVs and other edge devices. Comprehensive experiments across three benchmarks: LandCover.ai, VDD, and PhenoBench, demonstrate that DAS‑SK consistently achieves state‑of‑the‑art performance, while being more efficient than CNN‑, transformer‑, and hybrid‑based competitors. Notably, DAS‑SK requires up to 21x fewer parameters and 19x fewer GFLOPs than top‑performing transformer models. These findings establish DAS‑SK as a robust, efficient, and scalable solution for real‑time agricultural robotics and high‑resolution remote sensing, with strong potential for broader deployment in other vision domains.

Abstract:
Semantic segmentation and lane detection are crucial tasks in autonomous driving systems. Conventional approaches predominantly rely on deep neural networks (DNNs), which incur high energy costs due to extensive analog‑to‑digital conversions and large‑scale image computations required for low‑latency, real‑time responses. Diffractive optical neural networks (DONNs) have shown promising advantages over conventional DNNs on digital or optoelectronic computing platforms in energy efficiency. By performing all‑optical image processing via light diffraction at the speed of light, DONNs save computation energy costs while reducing the overhead associated with analog‑to‑digital conversions by all‑optical encoding and computing. In this work, we propose a novel all‑optical computing framework for RGB image segmentation and lane detection in autonomous driving applications. Our experimental results demonstrate the effectiveness of the DONN system for image segmentation on the CityScapes dataset. Additionally, we conduct case studies on lane detection using a customized indoor track dataset and simulated driving scenarios in CARLA, where we further evaluate the model's generalizability under diverse environmental conditions.

Abstract:
Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB‑Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality‑specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision‑language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark‑object semantics that prior noise‑suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state‑of‑the‑art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

Abstract:
Urban design profoundly impacts public spaces and community engagement. Traditional top‑down methods often overlook public input, creating a gap in design aspirations and reality. Recent advancements in digital tools, like City Information Modelling and augmented reality, have enabled a more participatory process involving more stakeholders in urban design. Further, deep learning and latent diffusion models have lowered barriers for design generation, providing even more opportunities for participatory urban design. Combining state‑of‑the‑art latent diffusion models with interactive semantic segmentation, we propose RECITYGEN, a novel tool that allows users to interactively create variational street view images of urban environments using text prompts. In a pilot project in Beijing, users employed RECITYGEN to suggest improvements for an ongoing Urban Regeneration project. Despite some limitations, RECITYGEN has shown significant potential in aligning with public preferences, indicating a shift towards more dynamic and inclusive urban planning methods. The source code for the project can be found at RECITYGEN GitHub.

Abstract:
Recently, Segment Anything Model (SAM) has demonstrated strong generalizability in various instance segmentation tasks. However, its performance is severely dependent on the quality of manual prompts. In addition, the RGB images that instance segmentation methods normally use inherently lack depth information. As a result, the ability of these methods to perceive spatial structures and delineate object boundaries is hindered. To address these challenges, we propose a Self‑prompted Depth‑Aware SAM (SPDA‑SAM) for instance segmentation. Specifically, we design a Semantic‑Spatial Self‑prompt Module (SSSPM) which extracts the semantic and spatial prompts from the image encoder and the mask decoder of SAM, respectively. Furthermore, we introduce a Coarse‑to‑Fine RGB‑D Fusion Module (C2FFM), in which the features extracted from a monocular RGB image and the depth map estimated from it are fused. In particular, the structural information in the depth map is used to provide coarse‑grained guidance to feature fusion, while local variations in depth are encoded in order to fuse fine‑grained feature representations. To our knowledge, SAM has not been explored in such self‑prompted and depth‑aware manners. Experimental results demonstrate that our SPDA‑SAM outperforms its state‑of‑the‑art counterparts across twelve different data sets. These promising results should be due to the guidance of the self‑prompts and the compensation for the spatial information loss by the coarse‑to‑fine RGB‑D fusion operation.

Abstract:
Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre‑trained semantic segmentation models in real‑world applications across multiple domains. Continual Test‑Time Adaptation(CTTA) has emerged as a promising approach to address cross‑domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long‑term adaptation. Recent prompt‑tuning‑based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi‑scale prompt diversity, 2)inadequate incorporation of instance‑specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi‑scale Global‑Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global‑ and instance‑level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive‑scale Instance Prompt(AIP) and a Multi‑scale Global‑level Prompt(MGP). AIP dynamically learns lightweight and instance‑specific prompts to mitigate error accumulation with adaptive optimal‑scale selection mechanism. MGP captures domain‑level knowledge across different scales to ensure robust adaptation with anti‑forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual‑level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state‑of‑the‑art methods, achieving robust adaptation across continually changing target domains.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) extends traditional closed‑set segmentation by enabling pixel‑wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision‑language models (VLMs) like CLIP, their reliance on image‑level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region‑level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single‑stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image‑text similarity, effectively reducing hallucinations; (ii) a region‑aware alignment module that establishes precise region‑level visual‑textual correspondences; and (iii) a dual‑stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A‑847, PC‑459, A‑150, PC‑59, PAS‑20, and PAS‑20b) demonstrate its competitive performance and strong generalization in open‑vocabulary settings.

Abstract:
Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring‑based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real‑world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large‑scale dataset GBC‑FS 2025, which contains highly complex and dense sub‑cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real‑world cell graphs are non‑bipartite, with a high prevalence of odd‑length cycles (predominantly triangles). This makes simple 2‑coloring theory insufficient for handling complex tissues, while higher‑chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real‑world contexts, we propose Disco (Densely‑overlapping Cell Instance Segmentation via Adjacency‑aware COllaborative Coloring), an adjacency‑aware framework based on the "divide and conquer" principle. It uniquely combines a data‑driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, "Explicit Marking" strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a "conflict set." Second, "Implicit Disambiguation" mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations.

Abstract:
Balancing accuracy and latency on high‑resolution images is a critical challenge for lightweight models, particularly for Transformer‑based architectures that often suffer from excessive latency. To address this issue, we introduce ReGLA, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU‑based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi‑teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA‑M achieves 80.85% Top‑1 accuracy on ImageNet‑1K at 224px, with only 4.98 ms latency at 512px. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of 3.1% AP on COCO object detection and 3.6% mIoU on ADE20K semantic segmentation, establishing it as a state‑of‑the‑art solution for high‑resolution visual applications.

Abstract:
Deep neural networks for visual perception are highly susceptible to domain shift, which poses a critical challenge for real‑world deployment under conditions that differ from the training data. To address this domain generalization challenge, we propose a cross‑modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single‑modality RGB model. We leverage event cameras as a source of privileged information, available only during training. The two modalities exhibit complementary characteristics: the RGB stream is semantically dense but domain‑dependent, whereas the event stream is sparse yet more domain‑invariant. Direct feature alignment between them is therefore suboptimal, as it forces the RGB encoder to mimic the sparse event representation, thereby losing semantic detail. To overcome this, we introduce Privileged Event‑based Predictive Regularization (PEPR), which reframes LUPI as a predictive problem in a shared latent space. Instead of enforcing direct cross‑modal alignment, we train the RGB encoder with PEPR to predict event‑based latent features, distilling robustness without sacrificing semantic richness. The resulting standalone RGB model consistently improves robustness to day‑to‑night and other domain shifts, outperforming alignment‑based baselines across object detection and semantic segmentation.

Abstract:
Atomic Force Microscopy (AFM) enables high‑resolution surface imaging at the nanoscale, yet the output is often degraded by artifacts introduced by environmental noise, scanning imperfections, and tip‑sample interactions. To address this challenge, a lightweight and fully automated framework for artifact detection and restoration in AFM image analysis is presented. The pipeline begins with a classification model that determines whether an AFM image contains artifacts. If necessary, a lightweight semantic segmentation network, custom‑designed and trained on AFM data, is applied to generate precise artifact masks. These masks are adaptively expanded based on their structural orientation and then inpainted using a directional neighbor‑based interpolation strategy to preserve 3D surface continuity. A localized Gaussian smoothing operation is then applied for seamless restoration. The system is integrated into a user‑friendly GUI that supports real‑time parameter adjustments and batch processing. Experimental results demonstrate the effective artifact removal while preserving nanoscale structural details, providing a robust, geometry‑aware solution for high‑fidelity AFM data interpretation.

Abstract:
The limited sample size and insufficient diversity of lung nodule CT datasets severely restrict the performance and generalization ability of detection models. Existing methods generate images with insufficient diversity and controllability, suffering from issues such as monotonous texture features and distorted anatomical structures. Therefore, we propose a two‑stage generative adversarial network (TSGAN) to enhance the diversity and spatial controllability of synthetic data by decoupling the morphological structure and texture features of lung nodules. In the first stage, StyleGAN is used to generate semantic segmentation mask images, encoding lung nodules and tissue backgrounds to control the anatomical structure of lung nodule images; The second stage uses the DL‑Pix2Pix model to translate the mask map into CT images, employing local importance attention to capture local features, while utilizing dynamic weight multi‑head window attention to enhance the modeling capability of lung nodule texture and background. Compared to the original dataset, the accuracy improved by 4.6% and mAP by 4% on the LUNA16 dataset. Experimental results demonstrate that TSGAN can enhance the quality of synthetic images and the performance of detection models.

Abstract:
Multi‑task problem solving has been shown to improve the accuracy of the individual tasks, which is an important feature for robots, as they have a limited resource. However, when the number of labels for each task is not equal, namely imbalanced data exist, a problem may arise due to insufficient number of samples, and labeling is not very easy for mobile robots in every environment. We propose a method that can learn tasks even in the absence of the ground truth labels for some of the tasks. We also provide a detailed analysis of the proposed method. An interesting finding is related to the interaction of the tasks. We show a methodology to find out which tasks can improve the performance of other tasks. We investigate this by training the teacher network with the task outputs such as depth as inputs. We further provide empirical evidence when trained with a small amount of data. We use semantic segmentation and depth estimation tasks on different datasets, NYUDv2 and Cityscapes.

Abstract:
Semantic segmentation of microscopy images is a critical task for high‑throughput materials characterisation, yet its automation is severely constrained by the prohibitive cost, subjectivity, and scarcity of expert‑annotated data. While physics‑based simulations offer a scalable alternative to manual labelling, models trained on such data historically fail to generalise due to a significant domain gap, lacking the complex textures, noise patterns, and imaging artefacts inherent to experimental data. This paper introduces a novel framework for labour‑free segmentation that successfully bridges this simulation‑to‑reality gap. Our pipeline leverages phase‑field simulations to generate an abundant source of microstructural morphologies with perfect, intrinsically‑derived ground‑truth masks. We then employ a Cycle‑Consistent Generative Adversarial Network (CycleGAN) for unpaired image‑to‑image translation, transforming the clean simulations into a large‑scale dataset of high‑fidelity, realistic SEM images. A U‑Net model, trained exclusively on this synthetic data, demonstrated remarkable generalisation when deployed on unseen experimental images, achieving a mean Boundary F1‑Score of 0.90 and an Intersection over Union (IOU) of 0.88. Comprehensive validation using t‑SNE feature‑space projection and Shannon entropy analysis confirms that our synthetic images are statistically and featurally indistinguishable from the real data manifold. By completely decoupling model training from manual annotation, our generative framework transforms a data‑scarce problem into one of data abundance, providing a robust and fully automated solution to accelerate materials discovery and analysis.

Abstract:
This work presents a mapless global navigation approach for outdoor applications. It combines the exploratory capacity of conditional variational autoencoders (CVAEs) to generate trajectories and the semantic segmentation capabilities of a lightweight visual language model (VLM) to select the trajectory to execute. Open‑vocabulary segmentation is used to score and select the generated trajectories based on natural language, and a state‑of‑the‑art local planner executes velocity commands. One of the key features of the proposed approach is its ability to generate a large variability of trajectories and to select them and navigate in real‑time. The approach was validated through real‑world outdoor navigation experiments, achieving superior performance compared to state‑of‑the‑art methods. A video showing an experimental run of the system can be found in https://www.youtube.com/watch?v=i3R5ey5O2yk.

Abstract:
Semantic segmentation networks, which are essential for robotic perception, often suffer from performance degradation when the visual distribution of the deployment environment differs from that of the source dataset on which they were trained. Unsupervised Domain Adaptation (UDA) addresses this challenge by adapting the network to the robot's target environment without external supervision, leveraging the large amounts of data a robot might naturally collect during long‑term operation. In such settings, UDA methods can exploit multi‑view consistency across the environment's map to fine‑tune the model in an unsupervised fashion and mitigate domain shift. However, these approaches remain sensitive to cross‑view instance‑level inconsistencies. In this work, we propose a method that starts from a volumetric 3D map to generate multi‑view consistent pseudo‑labels. We then refine these labels using the zero‑shot instance segmentation capabilities of a foundation model, enforcing instance‑level coherence. The refined annotations serve as supervision for self‑supervised fine‑tuning, enabling the robot to adapt its perception system at deployment time. Experiments on real‑world data demonstrate that our approach consistently improves performance over state‑of‑the‑art UDA baselines based on multi‑view consistency, without requiring any ground‑truth labels in the target domain.

Abstract:
We address the problem of reactive motion planning for quadrotors operating in unknown environments with dynamic obstacles. Our approach leverages a 4‑dimensional spatio‑temporal planner, integrated with vision‑based Safe Flight Corridor (SFC) generation and trajectory optimization. Unlike prior methods that rely on map fusion, our framework is mapless, enabling collision avoidance directly from perception while reducing computational overhead. Dynamic obstacles are detected and tracked using a vision‑based object segmentation and tracking pipeline, allowing robust classification of static versus dynamic elements in the scene. To further enhance robustness, we introduce a backup planning module that reactively avoids dynamic obstacles when no direct path to the goal is available, mitigating the risk of collisions during deadlock situations. We validate our method extensively in both simulation and real‑world hardware experiments, and benchmark it against state‑of‑the‑art approaches, showing significant advantages for reactive UAV navigation in dynamic, unknown environments.

Abstract:
Safe UAV emergency landing requires more than just identifying flat terrain; it demands understanding complex semantic risks (e.g., crowds, temporary structures) invisible to traditional geometric sensors. In this paper, we propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for global context‑aware landing site assessment. Unlike local geometric methods, our approach employs a coarse‑to‑fine pipeline: first, a lightweight semantic segmentation module efficiently pre‑screens candidate areas; second, a vision‑language reasoning agent fuses visual features with Point‑of‑Interest (POI) data to detect subtle hazards. To validate this approach, we construct and release the Emergency Landing Site Selection (ELSS) benchmark. Experiments demonstrate that our framework significantly outperforms geometric baselines in risk identification accuracy. Furthermore, qualitative results confirm its ability to generate human‑like, interpretable justifications, enhancing trust in automated decision‑making. The benchmark dataset is publicly accessible at https://anonymous.4open.science/r/ELSS‑dataset‑43D7.

Abstract:
Sorghum is a globally important cereal grown widely in water‑limited and stress‑prone regions. Its strong drought tolerance makes it a priority crop for climate‑resilient agriculture. Improving water‑use efficiency in sorghum requires precise characterisation of stomatal traits, as stomata control of gas exchange, transpiration and photosynthesis have a major influence on crop performance. Automated analysis of sorghum stomata is difficult because the stomata are small (often less than 40 μm in length in grasses such as sorghum) and vary in shape across genotypes and leaf surfaces. Automated segmentation contributes to high‑throughput stomatal phenotyping, yet current methods still face challenges related to nested small structures and annotation bottlenecks. In this paper, we propose a semi‑supervised instance segmentation framework tailored for analysis of sorghum stomatal components. We collect and annotate a sorghum leaf imagery dataset containing 11,060 human‑annotated patches, covering the three stomatal components (pore, guard cell and complex area) across multiple genotypes and leaf surfaces. To improve the detection of tiny structures, we split high‑resolution microscopy images into overlapping small patches. We then apply a pseudo‑labelling strategy to unannotated images, producing an additional 56,428 pseudo‑labelled patches. Benchmarking across semantic and instance segmentation models shows substantial performance gains: for semantic models the top mIoU increases from 65.93% to 70.35%, whereas for instance models the top AP rises from 28.30% to 46.10%. These results demonstrate that combining patch‑based preprocessing with semi‑supervised learning significantly improves the segmentation of fine stomatal structures. The proposed framework supports scalable extraction of stomatal traits and facilitates broader adoption of AI‑driven phenotyping in crop science.

Abstract:
We argue that existing training‑free segmentation methods rely on an implicit and limiting assumption, that segmentation is a spectral graph partitioning problem over diffusion‑derived affinities. Such approaches, based on global graph partitioning and eigenvector‑based formulations of affinity matrices, suffer from several fundamental drawbacks, they require pre‑selecting the number of clusters, induce boundary oversmoothing due to spectral relaxation, and remain highly sensitive to noisy or multi‑modal affinity distributions. Moreover, many prior works neglect the importance of local neighborhood structure, which plays a crucial role in stabilizing affinity propagation and preserving fine‑grained contours. To address these limitations, we reformulate training‑free segmentation as a stochastic flow equilibrium problem over diffusion‑induced affinity graphs, where segmentation emerges from a stochastic propagation process that integrates global diffusion attention with local neighborhoods extracted from stable diffusion, yielding a sparse yet expressive affinity structure. Building on this formulation, we introduce a Markov propagation scheme that performs random‑walk‑based label diffusion with an adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Experiments across seven widely used semantic segmentation benchmarks demonstrate that our method achieves state‑of‑the‑art zero‑shot performance, producing sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral‑clustering‑based approaches.

Abstract:
Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under‑segmentation in dense clusters. To bridge this gap, we propose FG‑TreeSeg, a training‑free framework for tree crown instance segmentation that transfers flow‑based delineation from biomedical imaging to remote sensing. By modeling tree crowns as star‑convex objects within a topological flow field using Cellpose‑SAM, the FG‑TreeSeg framework forces the separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training‑free solution for tree crown instance segmentation and labels generation.

Abstract:
This paper presents YOLOE‑26, a unified framework that integrates the deployment‑optimized YOLO26(or YOLOv26) architecture with the open‑vocabulary learning paradigm of YOLOE for real‑time open‑vocabulary instance segmentation. Building on the NMS‑free, end‑to‑end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed‑set recognition. YOLOE‑26 employs a convolutional backbone with PAN/FPN‑style multi‑scale feature aggregation, followed by end‑to‑end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built‑in vocabulary. To enable efficient open‑vocabulary reasoning, the framework incorporates Re‑Parameterizable Region‑Text Alignment (RepRTA) for zero‑overhead text prompting, a Semantic‑Activated Visual Prompt Encoder (SAVPE) for example‑guided segmentation, and Lazy Region Prompt Contrast for prompt‑free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text‑prompted, visual‑prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy‑efficiency trade‑offs across model sizes in both prompted and prompt‑free settings. The training strategy leverages large‑scale detection and grounding datasets with multi‑task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE‑26 provides a practical and scalable solution for real‑time open‑vocabulary instance segmentation in dynamic, real‑world environments.

Abstract:
Open‑vocabulary grounding requires accurate vision‑language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine‑grained expressiveness or introduce token‑level alignment with explicit supervision or heavy cross‑attention designs. We propose ExpAlign, a theoretically grounded vision‑language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention‑based soft MIL pooling over token‑region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy‑based multi‑scale consistency regularization scheme, including a Top‑K multi‑positive contrastive objective and a Geometry‑Aware Consistency Objective derived from a Lagrangian‑constrained free‑energy minimization. Extensive experiments show that ExpAlign consistently improves open‑vocabulary detection and zero‑shot instance segmentation, particularly on long‑tail categories. Most notably, it achieves 36.2 AP_r on the LVIS minival split, outperforming other state‑of‑the‑art methods at comparable model scale, while remaining lightweight and inference‑efficient.

Abstract:
Dense prediction infers per‑pixel values from a single image and is fundamental to 3D perception and robotics. Although real‑world scenes exhibit strong structure, existing methods treat it as an independent pixel‑wise prediction, often resulting in structural inconsistencies. We propose SHED, a novel encoder‑decoder architecture that enforces geometric prior explicitly by incorporating segmentation into dense prediction. By bidirectional hierarchical reasoning, segment tokens are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, allowing the segment hierarchy to emerge without explicit segmentation supervision. SHED improves depth boundary sharpness and segment coherence, while demonstrating strong cross‑domain generalization from synthetic to the real‑world environments. Its hierarchy‑aware decoder better captures global 3D scene layouts, leading to improved semantic segmentation performance. Moreover, SHED enhances 3D reconstruction quality and reveals interpretable part‑level structures that are often missed by conventional pixel‑wise methods.

Abstract:
Continued pretraining is optimized with fixed self‑supervised tasks but selected by downstream performance, creating a coarse feedback loop in which practitioners evaluate checkpoints, change data mixtures or objectives, and restart runs, while individual updates remain blind to target capabilities. We ask whether a small set of verifiable downstream examples can provide step‑level feedback without directly supervising the learner. We introduce V‑pretraining, which decouples a learner trained only with a self‑supervised loss from a lightweight task designer that constructs targets or views for unlabeled batches. Given the current learner and batch, V‑pretraining scores a candidate construction by predicting the first‑order reduction in downstream loss after the induced self‑supervised update. The designer maximizes this value; the learner then applies the update with targets or views detached, so downstream labels never update learner parameters. We instantiate V‑pretraining as adaptive top‑K soft targets for language modeling and learned views or masks for self‑supervised vision. Across both modalities, V‑pretraining improves target capabilities without degrading generalization. Under wall‑clock‑matched continued pretraining, it improves GSM8K Pass@1 for Qwen models using 1,024 GSM8K examples only as feedback, including a +7.4 point single‑run gain for Qwen2.5‑0.5B. In vision, it improves DINOv3 transfer to ADE20K semantic segmentation and NYUv2 depth estimation while preserving ImageNet linear accuracy, suggesting that feedback‑guided task construction can improve target capabilities without collapsing general‑purpose representations.

Abstract:
The Segment Anything Model has revolutionized image segmentation with its zero‑shot capabilities, yet its reliance on manual prompts hinders fully automated deployment. While integrating object detectors as prompt generators offers a pathway to automation, existing pipelines suffer from two fundamental limitations: objective mismatch, where detectors optimized for geometric localization do not correspond to the optimal prompting context required by SAM, and alignment overfitting in standard joint training, where the detector simply memorizes specific prompt adjustments for training samples rather than learning a generalizable policy. To bridge this gap, we introduce BLO‑Inst, a unified framework that aligns detection and segmentation objectives by bi‑level optimization. We formulate the alignment as a nested optimization problem over disjoint data splits. In the lower level, the SAM is fine‑tuned to maximize segmentation fidelity given the current detection proposals on a subset (D_1). In the upper level, the detector is updated to generate bounding boxes that explicitly minimize the validation loss of the fine‑tuned SAM on a separate subset (D_2). This effectively transforms the detector into a segmentation‑aware prompt generator, optimizing the bounding boxes not just for localization accuracy, but for downstream mask quality. Extensive experiments demonstrate that BLO‑Inst achieves superior performance, outperforming standard baselines on tasks in general and biomedical domains.

Abstract:
Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Just noticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task‑aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task‑aligned JND formulation that predicts the maximum tolerable per‑feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND‑based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non‑critical feature regions. As an application, we further apply FeatJND to token‑wise dynamic quantization and show that FeatJND‑guided step‑size allocation yields clear gains over random step‑size permutation and global uniform step size under the same noise budget. Our code will be released after publication.

Abstract:
Understanding the three‑dimensional motion of bubbles is essential for interpreting transport and mixing in multiphase flows, especially when bubbles deform under shear or move rapidly through the flow field. In many laboratory setups, only a single high‑speed camera is available, which limits measurements to two dimensions. Traditional image‑processing tools can identify bubbles only when they appear circular and isolated, but they struggle with irregularly shaped bubbles, shear‑induced deformations, strong blurring, and partial overlaps. Multi‑camera systems could overcome these issues, but require significant hardware additions and calibration effort. In this work, we introduce a new machine‑learning framework that can detect bubbles and estimate their depth using only a single 20 kHz high‑speed camera with 3 \textmu m resolution. The method first uses a large unlabeled dataset and clusters the bubbles with an unsupervised algorithm to reveal their underlying structure. These clusters provide pseudo labels, which are combined with a small set of true in‑plane bubble labels to train a semi‑supervised model that generalizes across different bubble appearances. These components produce a continuous depth‑proxy score that indicates how close each bubble is to the imaging plane, even when bubbles are distorted or irregularly shaped. In parallel, we perform robust bubble identification using instance segmentation, which separates touching, overlapping, and elongated bubbles generated by high‑velocity shear. Quantitatively, the in‑plane segmentation baseline achieves strong held‑out performance with Average Precision (AP) = 0.818, implying stable detection across thresholds, clutter, bubble detection Precision of 0.901, and a False‑Positive Rate (FPR) near 6.1%, hence low spurious bubbles and cleaner statistics under the tested acquisition conditions.

Abstract:
Nuclei instance segmentation in hematoxylin and eosin (H&E)‑stained images plays an important role in automated histological image analysis, with various applications in downstream tasks. While several machine learning and deep learning approaches have been proposed for nuclei instance segmentation, most research in this field focuses on developing new segmentation algorithms and benchmarking them on a limited number of arbitrarily selected public datasets. In this work, rather than focusing on model development, we focused on the datasets used for this task. Based on an extensive literature review, we identified manually annotated, publicly available datasets of H&E‑stained images for nuclei instance segmentation and standardized them into a unified input and annotation format. Using two state‑of‑the‑art segmentation models, one based on convolutional neural networks (CNNs) and one based on a hybrid CNN and vision transformer architecture, we systematically evaluated and ranked these datasets based on their nuclei instance segmentation performance. Furthermore, we proposed a unified test set (NucFuse‑test) for fair cross‑dataset evaluation and a unified training set (NucFuse‑train) for improved segmentation performance by merging images from multiple datasets. By evaluating and ranking the datasets, performing comprehensive analyses, generating fused datasets, conducting external validation, and making our implementation publicly available, we provided a new benchmark for training, testing, and evaluating nuclei instance segmentation models on H&E‑stained histological images.

Abstract:
Open‑vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision‑language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre‑trained on image‑text pairs, are biased toward salient, object‑centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency‑aware foreground‑background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency‑aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide‑and‑conquer manner. Additionally, we propose a Hierarchical Refinement Module (HRM) that leverages pixel‑wise spatial contexts and enables channel‑wise feature refinement through multi‑level updates. Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state‑of‑the‑art methods.

Abstract:
In this study, we present a low‑cost and unified framework for vectorized road mapping leveraging enhanced inverse perspective mapping (IPM). In this framework, Catmull‑Rom splines are utilized to characterize lane lines, and all the other ground markings are depicted using polygons uniformly. The results from instance segmentation serve as references to refine the three‑dimensional position of spline control points and polygon corner points. In conjunction with this process, the homography matrix of IPM and vehicle poses are optimized simultaneously. Our proposed framework significantly reduces the mapping errors associated with IPM. It also improves the accuracy of the initial IPM homography matrix and the predicted vehicle poses. Furthermore, it addresses the limitations imposed by the coplanarity assumption in IPM. These enhancements enable IPM to be effectively applied to vectorized road mapping, which serves a cost‑effective solution with enhanced accuracy. In addition, our framework generalizes road map elements to include all common ground markings and lane lines. The proposed framework is evaluated in two different practical scenarios, and the test results show that our method can automatically generate high‑precision maps with near‑centimeter‑level accuracy. Importantly, the optimized IPM matrix achieves an accuracy comparable to that of manual calibration, while the accuracy of vehicle poses is also significantly improved.

Abstract:
Obstructions such as raindrops, fences, or dust degrade captured images, especially when mechanical cleaning is infeasible. Conventional solutions to obstructions rely on a bulky compound optics array or computational inpainting, which compromise compactness or fidelity. Metalenses composed of subwavelength meta‑atoms promise compact imaging, but simultaneous achievement of broadband and obstruction‑free imaging remains a challenge, since a metalens that images distant scenes across a broadband spectrum cannot properly defocus near‑depth occlusions. Here, we introduce a learned split‑spectrum metalens that enables broadband obstruction‑free imaging. Our approach divides the spectrum of each RGB channel into pass and stop bands with multi‑band spectral filtering and learns the metalens to focus light from far objects through pass bands, while filtering focused near‑depth light through stop bands. This optical signal is further enhanced using a neural network. Our learned split‑spectrum metalens achieves broadband and obstruction‑free imaging with relative PSNR gains of 32.29% and improves object detection and semantic segmentation accuracies with absolute gains of +13.54% mAP, +48.45% IoU, and +20.35% mIoU over a conventional hyperbolic design. This promises robust obstruction‑free sensing and vision for space‑constrained systems, such as mobile robots, drones, and endoscopes.

Abstract:
Accurate depth estimation is fundamental to 3D perception in autonomous driving, supporting tasks such as detection, tracking, and motion planning. However, monocular camera‑based 3D detection suffers from depth ambiguity and reduced robustness under challenging conditions. Radar provides complementary advantages such as resilience to poor lighting and adverse weather, but its sparsity and low resolution limit its direct use in detection frameworks. This motivates the need for effective Radar‑camera fusion with improved preprocessing and depth estimation strategies. We propose an end‑to‑end framework that enhances monocular 3D object detection through two key components. First, we introduce InstaRadar, an instance segmentation‑guided expansion method that leverages pre‑trained segmentation masks to enhance Radar density and semantic alignment, producing a more structured representation. InstaRadar achieves state‑of‑the‑art results in Radar‑guided depth estimation, showing its effectiveness in generating high‑quality depth features. Second, we integrate the pre‑trained RCDPT into the BEVDepth framework as a replacement for its depth module. With InstaRadar‑enhanced inputs, the RCDPT integration consistently improves 3D detection performance. Overall, these components yield steady gains over the baseline BEVDepth model, demonstrating the effectiveness of InstaRadar and the advantage of explicit depth supervision in 3D object detection. Although the framework lags behind Radar‑camera fusion models that directly extract BEV features, since Radar serves only as guidance rather than an independent feature stream, this limitation highlights potential for improvement. Future work will extend InstaRadar to point cloud‑like representations and integrate a dedicated Radar branch with temporal cues for enhanced BEV fusion.

Abstract:
Smart glasses enhance interactions with the environment by using head‑mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free‑form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision‑language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy to use.

Abstract:
Multimodal remote sensing technology significantly enhances the understanding of surface semantics by integrating heterogeneous data such as optical images, Synthetic Aperture Radar (SAR), and Digital Surface Models (DSM). However, in practical applications, the missing of modality data (e.g., optical or DSM) is a common and severe challenge, which leads to performance decline in traditional multimodal fusion models. Existing methods for addressing missing modalities still face limitations, including feature collapse and overly generalized recovered features. To address these issues, we propose STARS (Shared‑specific Translation and Alignment for missing‑modality Remote Sensing), a robust semantic segmentation framework for incomplete multimodal inputs. STARS is built on two key designs. First, we introduce an asymmetric alignment mechanism with bidirectional translation and stop‑gradient, which effectively prevents feature collapse and reduces sensitivity to hyperparameters. Second, we propose a Pixel‑level Semantic sampling Alignment (PSA) strategy that combines class‑balanced pixel sampling with cross‑modality semantic alignment loss, to mitigate alignment failures caused by severe class imbalance and improve minority‑class recognition.

Abstract:
We present a novel approach for extracting 3D atomic‑level information from transmission electron microscopy (TEM) images affected by significant noise. The approach is based on formulating depth estimation as a semantic segmentation problem. We address the resulting segmentation problem by training a deep convolutional neural network to generate pixel‑wise depth segmentation maps using simulated data corrupted by synthetic noise. The proposed method was applied to estimate the depth of atomic columns in CeO2 nanoparticles from simulated images and real‑world TEM data. Our experiments show that the resulting depth estimates are accurate, calibrated and robust to noise.

Abstract:
Mangroves are critical for climate‑change mitigation, requiring reliable monitoring for effective conservation. While deep learning has emerged as a powerful tool for mangrove detection, its progress is hindered by the limitations of existing datasets. In particular, many resources provide only annual map products without curated single‑date image‑mask pairs, limited to specific regions rather than global coverage, or remain inaccessible to the public. To address these challenges, we introduce MANGO, a large‑scale global dataset comprising 42,703 labeled image‑mask pairs across 124 countries. To construct this dataset, we retrieve all available Sentinel‑2 imagery within the year 2020 for mangrove regions and select the best single‑date observations that align with the mangrove annual mask. This selection is performed using a target detection‑driven approach that leverages pixel‑wise coordinate references to ensure adaptive and representative image‑mask pairings. We also provide a benchmark across diverse semantic segmentation architectures under a country‑disjoint split, establishing a foundation for scalable and reliable global mangrove monitoring.

Abstract:
Deep learning has shown remarkable progress in medical image semantic segmentation, yet its success heavily depends on large‑scale expert annotations and consistent data distributions. In practice, annotations are scarce, and images are collected from multiple scanners or centers, leading to mixed‑domain settings with unknown domain labels and severe domain gaps. Existing semi‑supervised or domain adaptation approaches typically assume either a single domain shift or access to explicit domain indices, which rarely hold in real‑world deployment. In this paper, we propose a domain‑invariant mixed‑domain semi‑supervised segmentation framework that jointly enhances data diversity and mitigates domain bias. A Copy‑Paste Mechanism (CPM) augments the training set by transferring informative regions across domains, while a Cluster Maximum Mean Discrepancy (CMMD) block clusters unlabeled features and aligns them with labeled anchors via an MMD objective, encouraging domain‑invariant representations. Integrated within a teacher‑student framework, our method achieves robust and precise segmentation even with very few labeled examples and multiple unknown domain discrepancies. Experiments on Fundus and M&Ms benchmarks demonstrate that our approach consistently surpasses semi‑supervised and domain adaptation methods, establishing a potential solution for mixed‑domain semi‑supervised medical image segmentation.

Abstract:
As an important and challenging problem in computer vision, Panoramic Semantic Segmentation (PASS) aims to give complete scene perception based on an ultra‑wide angle of view. Most PASS methods often focus on spherical geometry with RGB input or using the depth information in original or HHA format, which does not make full use of panoramic image geometry. To address these shortcomings, we propose REL‑SF4PASS with our REL depth representation based on cylindrical coordinate and Spherical‑dynamic Multi‑Modal Fusion SMMF. REL is made up of Rectified Depth, Elevation‑Gained Vertical Inclination Angle, and Lateral Orientation Angle, which fully represents 3D space in cylindrical coordinate style and the surface normal direction. SMMF aims to ensure the diversity of fusion for different panoramic image regions and reduce the breakage of cylinder side surface expansion in ERP projection, which uses different fusion strategies to match the different regions in panoramic images. Experimental results show that REL‑SF4PASS considerably improves performance and robustness on popular benchmark, Stanford2D3D Panoramic datasets. It gains 2.35% average mIoU improvement on all 3 folds and reduces the performance variance by approximately 70% when facing 3D disturbance.

Abstract:
Discrete video VAEs underpin modern text‑to‑video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross‑modal alignment and zero‑shot transfer. We introduce PyraTok, a language‑aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi‑scale text‑guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state‑of‑the‑art (SOTA) video reconstruction, consistently improves text‑to‑video quality, and sets new SOTA zero‑shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

Abstract:
Vision‑language pretraining has driven much of the recent progress in medical image representation learning, but this paradigm is constrained by the availability of paired image‑text data and by the reporting bias of clinical narratives. We ask whether competitive radiology encoders can be learned without any language supervision. We introduce RadJEPA, a self‑supervised framework built on a Joint Embedding Predictive Architecture and pretrained on approximately 840K unlabeled chest X‑ray images. The model learns to predict latent representations of masked target regions from a visible context region, an objective that differs from both image‑text contrastive pretraining and DINO‑style self‑distillation by explicitly modelling conditional structure in representation space. We evaluate RadJEPA primarily on radiology report generation with a frozen Vicuna‑7B decoder, and additionally substitute its encoder into four widely used vision‑language backbones (MedLLaVA, Qwen‑2.5, BLIP‑2, and Phi‑4). For completeness we also report disease classification and semantic segmentation results. Across two datasets and four metrics, RadJEPA matches or exceeds the strongest image‑only and vision‑language baselines while using a ViT‑B/14 backbone at 224 x 224 resolution.

Abstract:
This work focuses on national‑scale land‑use/land‑cover (LULC) semantic segmentation using ALOS‑2 single‑polarization (HH) SAR data over Japan, together with a companion binary water detection task. Building on SAR‑W‑MixMAE self‑supervised pretraining [1], we address common SAR dense‑prediction failure modes, boundary over‑smoothing, missed thin/slender structures, and rare‑class degradation under long‑tailed labels, without increasing pipeline complexity. We introduce three lightweight refinements: (i) injecting high‑resolution features into multi‑scale decoding, (ii) a progressive refine‑up head that alternates convolutional refinement and stepwise upsampling, and (iii) an α‑scale factor that tempers class reweighting within a focal+dice objective. The resulting model yields consistent improvements on the Japan‑wide ALOS‑2 LULC benchmark, particularly for under‑represented classes, and improves water detection across standard evaluation metrics.

Abstract:
Culverts and sewer pipes are critical components of drainage systems, and their failure can lead to serious risks to public safety and the environment. In this thesis, we explore methods to improve automated defect segmentation in culverts and sewer pipes. Collecting and annotating data in this field is cumbersome and requires domain knowledge. Having a large dataset for structural defect detection is therefore not feasible. Our proposed methods are tested under conditions with limited annotated data to demonstrate applicability to real‑world scenarios. Overall, this thesis proposes three methods to significantly enhance defect segmentation and handle data scarcity. This can be addressed either by enhancing the training data or by adjusting a models architecture. First, we evaluate preprocessing strategies, including traditional data augmentation and dynamic label injection. These techniques significantly improve segmentation performance, increasing both Intersection over Union (IoU) and F1 score. Second, we introduce FORTRESS, a novel architecture that combines depthwise separable convolutions, adaptive Kolmogorov‑Arnold Networks (KAN), and multi‑scale attention mechanisms. FORTRESS achieves state‑of‑the‑art performance on the culvert sewer pipe defect dataset, while significantly reducing the number of trainable parameters, as well as its computational cost. Finally, we investigate few‑shot semantic segmentation and its applicability to defect detection. Few‑shot learning aims to train models with only limited data available. By employing a bidirectional prototypical network with attention mechanisms, the model achieves richer feature representations and achieves satisfactory results across evaluation metrics.

Abstract:
Training deep computer vision models requires manual oversight or hyperparameter tuning of the learning rate (LR) schedule. While existing adaptive optimizers schedule the LR automatically, they suffer from computational and memory overhead, incompatibility with regularization, and suboptimal LR choices. In this work, we introduce the ZENITH (Zero‑overhead Evolution using Norm‑Informed Training History) optimizer, which adapts the LR using the temporal evolution of the gradient norm. Image classification experiments spanning 6 CNN architectures and 6 benchmarks demonstrate that ZENITH achieves higher test accuracy in lower wall‑clock time than baselines. It also yielded superior mAP in object detection, keypoint detection, and instance segmentation on MS COCO using the R‑CNN family of models. Furthermore, its compatibility with regularization enables even better generalization.

Abstract:
Existing segmentation models exhibit significant vulnerability to adversarial attacks.To improve robustness, adversarial training incorporates adversarial examples into model training. However, existing attack methods consider only global semantic information and ignore contextual semantic relationships within the samples, limiting the effectiveness of adversarial training. To address this issue, we propose EroSeg‑AT, a vulnerability‑aware adversarial training framework that leverages EroSeg to generate adversarial examples. EroSeg first selects sensitive pixels based on pixel‑level confidence and then progressively propagates perturbations to higher‑confidence pixels, effectively disrupting the semantic consistency of the samples. Experimental results show that, compared to existing methods, our approach significantly improves attack effectiveness and enhances model robustness under adversarial training.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS), which relies only on image‑level labels, has attracted significant attention for its cost‑effectiveness and scalability. Existing methods mainly enhance inter‑class distinctions and employ data augmentation to mitigate semantic ambiguity and reduce spurious activations. However, they often neglect the complex contextual dependencies among image patches, resulting in incomplete local representations and limited segmentation accuracy. To address these issues, we propose the Context Patch Fusion with Class Token Enhancement (CPF‑CTE) framework, which exploits contextual relations among patches to enrich feature representations and improve segmentation. At its core, the Contextual‑Fusion Bidirectional Long Short‑Term Memory (CF‑BiLSTM) module captures spatial dependencies between patches and enables bidirectional information flow, yielding a more comprehensive understanding of spatial correlations. This strengthens feature learning and segmentation robustness. Moreover, we introduce learnable class tokens that dynamically encode and refine class‑specific semantics, enhancing discriminative capability. By effectively integrating spatial and semantic cues, CPF‑CTE produces richer and more accurate representations of image content. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate that CPF‑CTE consistently surpasses prior WSSS methods.

Abstract:
Until open‑world foundation models match the performance of specialized approaches, deep learning systems remain dependent on task‑ and sensor‑specific data availability. To bridge the gap between available datasets and deployment domains, domain adaptation strategies are widely used. In this work, we propose XD‑MAP, a novel approach to transfer sensor‑specific knowledge from an image dataset to LiDAR, an entirely different sensing domain. Our method leverages detections on camera images to create a semantic parametric map. The map elements are modeled to produce pseudo labels in the target domain without any manual annotation effort. Unlike previous domain transfer approaches, our method does not require direct overlap between sensors and enables extending the angular perception range from a front‑view camera to a full 360° view. On our large‑scale road feature dataset, XD‑MAP outperforms single shot baseline approaches by +19.5 mIoU for 2D semantic segmentation, +19.5 PQth for 2D panoptic segmentation, and +32.3 mIoU in 3D semantic segmentation. The results demonstrate the effectiveness of our approach achieving strong performance on LiDAR data without any manual labeling.

Abstract:
This paper presents a novel cross‑modal visuo‑tactile perception framework for the 3D shape reconstruction of deformable linear objects (DLOs), with a specific focus on cables subject to severe visual occlusions. Unlike existing methods relying predominantly on vision, whose performance degrades under varying illumination, background clutter, or partial visibility, the proposed approach integrates foundation‑model‑based visual perception with adaptive tactile exploration. The visual pipeline exploits SAM for instance segmentation and Florence for semantic refinement, followed by skeletonization, endpoint detection, and point‑cloud extraction. Occluded cable segments are autonomously identified and explored with a tactile sensor, which provides local point clouds that are merged with the visual data through Euclidean clustering and topology‑preserving fusion. A B‑spline interpolation driven by endpoint‑guided point sorting yields a smooth and complete reconstruction of the cable shape. Experimental validation using a robotic manipulator equipped with an RGB‑D camera and a tactile pad demonstrates that the proposed framework accurately reconstructs both simple and highly curved single or multiple cable configurations, even when large portions are occluded. These results highlight the potential of foundation‑model‑enhanced cross‑modal perception for advancing robotic manipulation of deformable objects.

Abstract:
The fact that robots are getting deployed more often in dynamic environments, together with the increasing complexity of their software systems, raises the need for self‑adaptive approaches. In these environments robotic software systems increasingly operate amid (1) uncertainties, where symptoms are easy to observe but root causes are ambiguous, or (2) multiple uncertainties appear concurrently. We present SUNSET, a ROS2‑based exemplar that enables rigorous, repeatable evaluation of architecture‑based self‑adaptation in such conditions. It implements a sensor fusion semantic‑segmentation pipeline driven by a trained Machine Learning (ML) model whose input preprocessing can be perturbed to induce realistic performance degradations. The exemplar exposes five observable symptoms, where each can be caused by different root causes and supports concurrent uncertainties spanning self‑healing and self‑optimisation. SUNSET includes the segmentation pipeline, a trained ML model, uncertainty‑injection scripts, a baseline controller, and step‑by‑step integration and evaluation documentation to facilitate reproducible studies and fair comparison.

Abstract:
Developing cost‑efficient and reliable perception systems remains a central challenge for automated vehicles. LiDAR and camera‑based systems dominate, yet they present trade‑offs in cost, robustness and performance under adverse conditions. This work introduces a novel framework for learning‑based 3D semantic segmentation using Calyo Pulse, a modular, solid‑state 3D ultrasound sensor system for use in harsh and cluttered environments. A 3D U‑Net architecture is introduced and trained on the spatial ultrasound data for volumetric segmentation. Results demonstrate robust segmentation performance from Calyo Pulse sensors, with potential for further improvement through larger datasets, refined ground truth, and weighted loss functions. Importantly, this study highlights 3D ultrasound sensing as a promising complementary modality for reliable autonomy.

Abstract:
This paper presents GridNet‑HD, a multi‑modal dataset for 3D semantic segmentation of overhead electrical infrastructures, pairing high‑density LiDAR with high‑resolution oblique imagery. The dataset comprises 7,694 images and 2.5 billion points annotated into 11 classes, with predefined splits and mIoU metrics. Unimodal (LiDAR‑only, image‑only) and multi‑modal fusion baselines are provided. On GridNet‑HD, fusion models outperform the best unimodal baseline by +5.55 mIoU, highlighting the complementarity of geometry and appearance. As reviewed in Sec. 2, no public dataset jointly provides high‑density LiDAR and high‑resolution oblique imagery with 3D semantic labels for power‑line assets. Dataset, baselines, and codes are available: https://huggingface.co/collections/heig‑vd‑geo/gridnet‑hd.

Abstract:
Self‑supervised pretraining in remote sensing is mostly done using mid‑spatial resolution (MR) image datasets due to their high availability. Given the release of high‑resolution (HR) datasets, we ask how HR datasets can be included in self‑supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self‑supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self‑supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.

Abstract:
Accurate instance‑level segmentation of organelles in electron microscopy (EM) is critical for quantitative analysis of subcellular morphology and inter‑organelle interactions. However, current benchmarks, based on small, curated datasets, fail to capture the inherent heterogeneity and large spatial context of in‑the‑wild EM data, imposing fundamental limitations on current patch‑based methods. To address these limitations, we developed a large‑scale, multi‑source benchmark for multi‑organelle instance segmentation, comprising over 100,000 2D EM images across variety cell types and five organelle classes that capture real‑world variability. Dataset annotations were generated by our designed connectivity‑aware Label Propagation Algorithm (3D LPA) with expert refinement. We further benchmarked several state‑of‑the‑art models, including U‑Net, SAM variants, and Mask2Former. Our results show several limitations: current models struggle to generalize across heterogeneous EM data and perform poorly on organelles with global, distributed morphologies (e.g., Endoplasmic Reticulum). These findings underscore the fundamental mismatch between local‑context models and the challenge of modeling long‑range structural continuity in the presence of real‑world variability. The benchmark dataset and labeling tool will be publicly released soon.

Abstract:
DepthCropSeg++: a foundation model for crop segmentation, capable of segmenting different crop species under open in‑field environment. Crop segmentation is a fundamental task for modern agriculture, which closely relates to many downstream tasks such as plant phenotyping, density estimation, and weed control. In the era of foundation models, a number of generic large language and vision models have been developed. These models have demonstrated remarkable real world generalization due to significant model capacity and largescale datasets. However, current crop segmentation models mostly learn from limited data due to expensive pixel‑level labelling cost, often performing well only under specific crop types or controlled environment. In this work, we follow the vein of our previous work DepthCropSeg, an almost unsupervised approach to crop segmentation, to scale up a cross‑species and crossscene crop segmentation dataset, with 28,406 images across 30+ species and 15 environmental conditions. We also build upon a state‑of‑the‑art semantic segmentation architecture ViT‑Adapter architecture, enhance it with dynamic upsampling for improved detail awareness, and train the model with a two‑stage selftraining pipeline. To systematically validate model performance, we conduct comprehensive experiments to justify the effectiveness and generalization capabilities across multiple crop datasets. Results demonstrate that DepthCropSeg++ achieves 93.11% mIoU on a comprehensive testing set, outperforming both supervised baselines and general‑purpose vision foundation models like Segmentation Anything Model (SAM) by significant margins (+0.36% and +48.57% respectively). The model particularly excels in challenging scenarios including night‑time environment (86.90% mIoU), high‑density canopies (90.09% mIoU), and unseen crop varieties (90.09% mIoU), indicating a new state of the art for crop segmentation.

Abstract:
Enabling intuitive, language‑driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion‑guided framework that grounds free‑form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref‑IMotion, a diverse, multi‑institutional video dataset with dense spatiotemporal masks and rich motion‑centric expressions. SurgRef achieves state‑of‑the‑art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language‑driven surgical video segmentation.

Abstract:
Remote sensing video referring object segmentation (RS‑RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large‑scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS‑RVOS research through dual contributions in data and methodology. First, we construct RS‑RVOS Bench, the first large‑scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality‑aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory‑quality‑aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC‑SAM). MQC‑SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short‑term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention‑based memory integration mechanism with dynamic quality assessment, selectively updating high‑confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS‑RVOS Bench demonstrate that MQC‑SAM achieves state‑of‑the‑art performance.

Abstract:
Indoor environments evolve as objects move, appear, or leave the scene. Capturing these dynamics requires maintaining temporally consistent instance identities across intermittently captured 3D scans, even when changes are unobserved. We introduce and formalize the task of temporally sparse 4D indoor semantic instance segmentation (SIS), which jointly segments, identifies, and temporally associates object instances. This setting poses a challenge for existing 3DSIS methods, which require a discrete matching step due to their lack of temporal reasoning, and for 4D LiDAR approaches, which perform poorly due to their reliance on high‑frequency temporal measurements that are uncommon in the longer‑horizon evolution of indoor environments. We propose ReScene4D, a novel method that adapts 3DSIS architectures for 4DSIS without needing dense observations. Our method enables temporal information sharing‑‑using spatiotemporal contrastive loss, masking, and serialization‑‑to adaptively leverage geometric and semantic priors across observations. This shared context enables consistent instance tracking and improves standard 3DSIS performance. To evaluate this task, we define a new metric, t‑mAP, that extends mAP to reward temporal identity consistency. ReScene4D achieves state‑of‑the‑art performance on the 3RScan dataset, establishing a new benchmark for understanding evolving indoor scenes.

Abstract:
Semantic ultra‑high‑resolution (UHR) image segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer‑based models remain challenging in this setting because memory grows quadratically with the number of tokens, limiting either spatial resolution or contextual scope. We introduce CASWiT (Context‑Aware Stage‑Wise Transformer), a dual‑branch Swin‑based architecture that injects low‑resolution contextual information into fine‑grained high‑resolution features through lightweight stage‑wise cross‑attention. To strengthen cross‑scale learning, we also propose a SimMIM‑style pretraining strategy based on masked reconstruction of the high‑resolution image. Extensive experiments on the large‑scale FLAIR‑HUB aerial dataset demonstrate the effectiveness of CASWiT. Under our RGB‑only UHR protocol, CASWiT reaches 66.37% mIoU with a SegFormer decoder, improving over strong RGB baselines while also improving boundary quality. On the URUR benchmark, CASWiT reaches 49.2% mIoU under the official evaluation protocol, and it also transfers effectively to medical UHR segmentation benchmarks. Code and pretrained models are available at https://huggingface.co/collections/heig‑vd‑geo/caswit

Abstract:
Current research workflows for precise video segmentation are often forced into a compromise between labor‑intensive manual curation, costly commercial platforms, and/or privacy‑compromising cloud‑based services. The demand for high‑fidelity video instance segmentation in research is often hindered by the bottleneck of manual annotation and the privacy concerns of cloud‑based tools. We present SAMannot, an open‑source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human‑in‑the‑loop workflow. To address the high resource requirements of foundation models, we modified the SAM2 dependency and implemented a processing layer that minimizes computational overhead and maximizes throughput, ensuring a highly responsive user interface. Key features include persistent instance identity management, an automated ``lock‑and‑refine'' workflow with barrier frames, and a mask‑skeletonization‑based auto‑prompting mechanism. SAMannot facilitates the generation of research‑ready datasets in YOLO and PNG formats alongside structured interaction logs. Verified through animal behavior tracking use‑cases and subsets of the LVOS and DAVIS benchmark datasets, the tool provides a scalable, private, and cost‑effective alternative to commercial platforms for complex video annotation tasks.

Abstract:
Effective disaster response relies on rapid disaster response, where oblique aerial video is the primary modality for initial scouting due to its ability to maximize spatial coverage and situational awareness in limited flight time. However, the on‑board processing of high‑resolution oblique streams is severely bottlenecked by the strict Size, Weight, and Power (SWaP) constraints of Unmanned Aerial Vehicles (UAVs). The computational density required to process these wide‑field‑of‑view streams precludes low‑latency inference on standard edge hardware. To address this, we propose Temporal Token Reuse (TTR), an adaptive inference framework capable of accelerating video segmentation on embedded devices. TTR exploits the intrinsic spatiotemporal redundancy of aerial video by formulating image patches as tokens; it utilizes a lightweight similarity metric to dynamically identify static regions and propagate their precomputed deep features, thereby bypassing redundant backbone computations. We validate the framework on standard benchmarks and a newly curated Oblique Floodwater Dataset designed for hydrological monitoring. Experimental results on edge‑grade hardware demonstrate that TTR achieves a 30% reduction in inference latency with negligible degradation in segmentation accuracy (< 0.5% mIoU). These findings confirm that TTR effectively shifts the operational Pareto frontier, enabling high‑fidelity, real‑time oblique video understanding for time‑critical remote sensing missions

Abstract:
Graph‑based methods have proven to be effective in capturing relationships among points for 3D point cloud analysis. However, these methods often suffer from suboptimal graph structures, particularly due to sparse connections at boundary points and noisy connections in junction areas. To address these challenges, we propose a novel method that integrates a graph smoothing module with an enhanced local geometry learning module. Specifically, we identify the limitations of conventional graph structures, particularly in handling boundary points and junction areas. In response, we introduce a graph smoothing module designed to optimize the graph structure and minimize the negative impact of unreliable sparse and noisy connections. Based on the optimized graph structure, we improve the feature extract function with local geometry information. These include shape features derived from adaptive geometric descriptors based on eigenvectors and distribution features obtained through cylindrical coordinate transformation. Experimental results on real‑world datasets validate the effectiveness of our method in various point cloud learning tasks, i.e., classification, part segmentation, and semantic segmentation.

Abstract:
Tree canopy detection from aerial imagery is an important task for environmental monitoring, urban planning, and ecosystem analysis. Simulating real‑life data annotation scarcity, the Solafune Tree Canopy Detection competition provides a small and imbalanced dataset of only 150 annotated images, posing significant challenges for training deep models without severe overfitting. In this work, we evaluate five representative architectures, YOLOv11, Mask R‑CNN, DeepLabv3, Swin‑UNet, and DINOv2, to assess their suitability for canopy segmentation under extreme data scarcity. Our experiments show that pretrained convolution‑based models, particularly YOLOv11 and Mask R‑CNN, generalize significantly better than pretrained transformer‑based models. DeeplabV3, Swin‑UNet and DINOv2 underperform likely due to differences between semantic and instance segmentation tasks, the high data requirements of Vision Transformers, and the lack of strong inductive biases. These findings confirm that transformer‑based architectures struggle in low‑data regimes without substantial pretraining or augmentation and that differences between semantic and instance segmentation further affect model performance. We provide a detailed analysis of training strategies, augmentation policies, and model behavior under the small‑data constraint and demonstrate that lightweight CNN‑based methods remain the most reliable for canopy detection on limited imagery.

Abstract:
Few‑shot semantic segmentation of time‑series remote sensing images remains a critical challenge, particularly in regions where labeled data is scarce or costly to obtain. While state‑of‑the‑art models perform well under full supervision, their performance degrades significantly under limited labeling, limiting their real‑world applicability. In this work, we propose SAM‑Aug, a new annotation‑efficient framework that leverages the geometry‑aware segmentation capability of the Segment Anything Model (SAM) to improve few‑shot land cover mapping. Our approach constructs cloud‑free composite images from temporal sequences and applies SAM in a fully unsupervised manner to generate geometry‑aware mask priors. These priors are then integrated into training through a proposed loss function called RegionSmoothLoss, which enforces prediction consistency within each SAM‑derived region across temporal frames, effectively regularizing the model to respect semantically coherent structures. Extensive experiments on the PASTIS‑R benchmark under a 5 percent labeled setting demonstrate the effectiveness and robustness of SAM‑Aug. Averaged over three random seeds (42, 2025, 4090), our method achieves a mean test mIoU of 36.21 percent, outperforming the state‑of‑the‑art baseline by +2.33 percentage points, a relative improvement of 6.89 percent. Notably, on the most favorable split (seed=42), SAM‑Aug reaches a test mIoU of 40.28 percent, representing an 11.2 percent relative gain with no additional labeled data. The consistent improvement across all seeds confirms the generalization power of leveraging foundation model priors under annotation scarcity. Our results highlight that vision models like SAM can serve as useful regularizers in few‑shot remote sensing learning, offering a scalable and plug‑and‑play solution for land cover monitoring without requiring manual annotations or model fine‑tuning.

Abstract:
Accurate localisation in planetary robotics enables the advanced autonomy required to support the increased scale and scope of future missions. The successes of the Ingenuity helicopter and multiple planetary orbiters lay the groundwork for future missions that use ground‑aerial robotic teams. In this paper, we consider rovers using machine learning to localise themselves in a local aerial map using limited field‑of‑view monocular ground‑view RGB images as input. A key consideration for machine learning methods is that real space data with ground‑truth position labels suitable for training is scarce. In this work, we propose a novel method of localising rovers in an aerial map using cross‑view‑localising dual‑encoder deep neural networks. We leverage semantic segmentation with vision foundation models and high volume synthetic data to bridge the domain gap to real images. We also contribute a new cross‑view dataset of real‑world rover trajectories with corresponding ground‑truth localisation data captured in a planetary analogue facility, plus a high volume dataset of analogous synthetic image pairs. Using particle filters for state estimation with the cross‑view networks allows accurate position estimation over simple and complex trajectories based on sequences of ground‑view images.

Abstract:
Agglomeration refers to the process of crystal clustering due to interparticle forces. Crystal agglomeration analysis from microscopic images is challenging due to the inherent limitations of two‑dimensional imaging. Overlapping crystals may appear connected even when located at different depth layers. Because optical microscopes have a shallow depth of field, crystals that are in‑focus and out‑of‑focus in the same image typically reside on different depth layers and do not constitute true agglomeration. To address this, we first quantified camera focus with an instance camera focus prediction network to predict 2 class focus level that aligns better with visual observations than traditional image processing focus measures. Then an instance segmentation model is combined with the predicted focus level for agglomeration classification. Our proposed method has a higher agglomeration classification and segmentation accuracy than the baseline models on ammonium perchlorate crystal and sugar crystal dataset.

Abstract:
The demand for real‑time visual understanding and interaction in complex scenarios is increasingly critical for unmanned aerial vehicles. However, a significant challenge arises from the contradiction between the high computational cost of large Vision language models and the limited computing resources available on UAV edge devices. To address this challenge, this paper proposes a lightweight multimodal task platform based on BLIP‑2, integrated with YOLO‑World and YOLOv8‑Seg models. This integration extends the multi‑task capabilities of BLIP‑2 for UAV applications with minimal adaptation and without requiring task‑specific fine‑tuning on drone data. Firstly, the deep integration of BLIP‑2 with YOLO models enables it to leverage the precise perceptual results of YOLO for fundamental tasks like object detection and instance segmentation, thereby facilitating deeper visual‑attention understanding and reasoning. Secondly, a content‑aware key frame sampling mechanism based on K‑Means clustering is designed, which incorporates intelligent frame selection and temporal feature concatenation. This equips the lightweight BLIP‑2 architecture with the capability to handle video‑level interactive tasks effectively. Thirdly, a unified prompt optimization scheme for multi‑task adaptation is implemented. This scheme strategically injects structured event logs from the YOLO models as contextual information into BLIP‑2's input. Combined with output constraints designed to filter out technical details, this approach effectively guides the model to generate accurate and contextually relevant outputs for various tasks.

Abstract:
Vision‑Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision‑making. While recent VLMs demonstrate strong performance on multimodal benchmarks, their robustness to realistic perception degradation remains poorly understood. In this work, we systematically study semantic misalignment in VLMs under controlled degradation of upstream visual perception, using semantic segmentation on the Cityscapes dataset as a representative perception module. We introduce perception‑realistic corruptions that induce only moderate drops in conventional segmentation metrics, yet observe severe failures in downstream VLM behavior, including hallucinated object mentions, omission of safety‑critical entities, and inconsistent safety judgments. To quantify these effects, we propose a set of language‑level misalignment metrics that capture hallucination, critical omission, and safety misinterpretation, and analyze their relationship with segmentation quality across multiple contrastive and generative VLMs. Our results reveal a clear disconnect between pixel‑level robustness and multimodal semantic reliability, highlighting a critical limitation of current VLM‑based systems and motivating the need for evaluation frameworks that explicitly account for perception uncertainty in safety‑critical applications.

Abstract:
Lightweight vision networks have witnessed remarkable progress in recent years, yet achieving a satisfactory balance among parameter scale, computational overhead, and task performance remains difficult. Although many existing lightweight models manage to reduce computation considerably, they often do so at the expense of a substantial increase in parameter count (e.g., LSNet, MobileMamba), which still poses obstacles for deployment on resource‑limited devices. In parallel, some studies attempt to draw inspiration from human visual perception, but their modeling tends to oversimplify the visual process, making it hard to reflect how perception truly operates. Revisiting the cooperative mechanism of the human visual system, we propose GPM (Global‑to‑Parallel Multi‑scale Encoding). GPM first employs a Global Insight Generator (GIG) to extract holistic cues, and subsequently processes features of different scales through parallel branches: LSAE emphasizes mid‑/large‑scale semantic relations, while IRB (Inverted Residual Block) preserves fine‑grained texture information, jointly enabling coherent representation of global and local features. As such, GPM conforms to two characteristic behaviors of human vision perceiving the whole before focusing on details, and maintaining broad contextual awareness even during local attention. Built upon GPM, we further develop the lightweight H‑GPE network. Experiments on image classification, object detection, and semantic segmentation show that H‑GPE achieves strong performance while maintaining a balanced footprint in both FLOPs and parameters, delivering a more favorable accuracy‑efficiency trade‑off compared with recent state‑of‑the‑art lightweight models.

Abstract:
Audio‑visual semantic segmentation (AVSS) represents an extension of the audio‑visual segmentation (AVS) task, necessitating a semantic understanding of audio‑visual scenes beyond merely identifying sound‑emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, Stepping Stone Plus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre‑mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound‑emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound‑emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual‑textual alignment module (VTA) to facilitate cross‑modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post‑mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

Abstract:
This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial‑temporal understanding. To facilitate the development of fine‑grained spatial and temporal localization capabilities, we curate LoomData‑8.7k, a human‑centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state‑of‑the‑art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades‑STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video‑question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial‑temporal video understanding, setting a new standard in multimodal intelligence.

Abstract:
Organoids, sophisticated in vitro models of human tissues, are crucial for medical research due to their ability to simulate organ functions and assess drug responses accurately. Accurate organoid instance segmentation is critical for quantifying their dynamic behaviors, yet remains profoundly limited by high‑quality annotated datasets and pervasive overlap in microscopy imaging. While semi‑supervised learning (SSL) offers a solution to alleviate reliance on scarce labeled data, conventional SSL frameworks suffer from biases induced by noisy pseudo‑labels, particularly in overlapping regions. Synthesis‑assisted SSL (SA‑SSL) has been proposed for mitigating training biases in semi‑supervised semantic segmentation. We present the first adaptation of SA‑SSL to organoid instance segmentation and reveal that SA‑SSL struggles to disentangle intertwined organoids, often misrepresenting overlapping instances as a single entity. To overcome this, we propose Pseudo‑Label Unmixing (PLU), which identifies erroneous pseudo‑labels for overlapping instances and then regenerates organoid labels through instance decomposition. For image synthesis, we apply a contour‑based approach to synthesize organoid instances efficiently, particularly for overlapping cases. Instance‑level augmentations (IA) on pseudo‑labels before image synthesis further enhances the effect of synthetic data (SD). Rigorous experiments on two organoid datasets demonstrate our method's effectiveness, achieving performance comparable to fully supervised models using only 10% labeled data, and state‑of‑the‑art results. Ablation studies validate the contributions of PLU, contour‑based synthesis, and augmentation‑aware training. By addressing overlap at both pseudo‑label and synthesis levels, our work advances scalable, label‑efficient organoid analysis, unlocking new potential for high‑throughput applications in precision medicine.

Abstract:
The aim of Active Learning is to select the most informative samples from an unlabelled set of data. This is useful in cases where the amount of data is large and labelling is expensive, such as in machine vision or medical imaging. Two particularities of machine vision are first, that most of the images produced are free of defects, and second, that the amount of images produced is so big that we cannot store all acquired images. This results, on the one hand, in a strong class imbalance in defect distribution and, on the other hand, in a potential label shift caused by limited storage. To understand how these two forms of imbalance affect active learning algorithms, we propose a simulation study based on two open‑source datasets. We artificially create datasets for which we control the levels of class imbalance and label shift. Three standard active learning selection strategies are compared: random sampling, entropy‑based selection, and core‑set selection. We demonstrate that active learning strategies, and in particular the entropy‑based and core‑set selections, remain interesting and efficient even for highly imbalanced datasets. We also illustrate and measure the loss of efficiency that occurs in the situation a strong label shift.

Abstract:
Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay‑tokens/ .

Abstract:
We present MOSAIC‑GS, a novel, fully explicit, and computationally efficient approach for high‑fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting. Monocular reconstruction is inherently ill‑posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity‑based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage. Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings. To enable compact representations, fast training, and real‑time rendering while supporting non‑rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time‑dependent Poly‑Fourier curve for parameter‑efficient motion encoding. We demonstrate that MOSAIC‑GS achieves substantially faster optimization and rendering compared to existing methods, while maintaining reconstruction quality on par with state‑of‑the‑art approaches across standard monocular dynamic scene benchmarks.

Abstract:
Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight ‑ yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous‑perception research. In this work we tackle this bottleneck by leveraging temporal‑geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi‑modal pseudo‑labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric‑semantic consistency, and vice‑versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo‑labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80‑150 and 150‑250 meters range, respectively.

Abstract:
Monocular 3D object detection offers a low‑cost alternative to LiDAR, yet remains less accurate due to the difficulty of estimating metric depth from a single image. We systematically evaluate how depth backbones and feature engineering affect a monocular Pseudo‑LiDAR pipeline on the KITTI validation split. Specifically, we compare NeWCRFs (supervised metric depth) against Depth Anything V2 Metric‑Outdoor (Base) under an identical pseudo‑LiDAR generation and PointRCNN detection protocol. NeWCRFs yields stronger downstream 3D detection, achieving 10.50% AP_3D at IoU=0.7 on the Moderate split using grayscale intensity (Exp~2). We further test point‑cloud augmentations using appearance cues (grayscale intensity) and semantic cues (instance segmentation confidence). Contrary to the expectation that semantics would substantially close the gap, these features provide only marginal gains, and mask‑based sampling can degrade performance by removing contextual geometry. Finally, we report a depth‑accuracy‑versus‑distance diagnostic using ground‑truth 2D boxes (including Ped/Cyc), highlighting that coarse depth correctness does not fully predict strict 3D IoU. Overall, under an off‑the‑shelf LiDAR detector, depth‑backbone choice and geometric fidelity dominate performance, outweighing secondary feature injection.

Abstract:
Optics‑guided thermal UAV image super‑resolution has attracted significant research interest due to its potential in all‑weather monitoring applications. However, existing methods typically compress optical features to match thermal feature dimensions for cross‑modal alignment and fusion, which not only causes the loss of high‑frequency information that is beneficial for thermal super‑resolution, but also introduces physically inconsistent artifacts such as texture distortions and edge blurring by overlooking differences in the imaging physics between modalities. To address these challenges, we propose PCNet to achieve cross‑resolution mutual enhancement between optical and thermal modalities, while physically constraining the optical guidance process via thermal conduction to enable robust thermal UAV image super‑resolution. In particular, we design a Cross‑Resolution Mutual Enhancement Module (CRME) to jointly optimize thermal image super‑resolution and optical‑to‑thermal modality conversion, facilitating effective bidirectional feature interaction across resolutions while preserving high‑frequency optical priors. Moreover, we propose a Physics‑Driven Thermal Conduction Module (PDTM) that incorporates two‑dimensional heat conduction into optical guidance, modeling spatially‑varying heat conduction properties to prevent inconsistent artifacts. In addition, we introduce a temperature consistency loss that enforces regional distribution consistency and boundary gradient smoothness to ensure generated thermal images align with real‑world thermal radiation principles. Extensive experiments on VGTSR2.0 and DroneVehicle datasets demonstrate that PCNet significantly outperforms state‑of‑the‑art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.

Abstract:
Efficient trajectory planning in off‑road terrains presents a formidable challenge for autonomous vehicles, often necessitating complex multi‑step pipelines. However, traditional approaches exhibit limited adaptability in dynamic environments. To address these limitations, this paper proposes OFF‑EMMA, a novel end‑to‑end multimodal framework designed to overcome the deficiencies of insufficient spatial perception and unstable reasoning in visual‑language‑action (VLA) models for off‑road autonomous driving scenarios. The framework explicitly annotates input images through the design of a visual prompt block and introduces a chain‑of‑thought with self‑consistency (COT‑SC) reasoning strategy to enhance the accuracy and robustness of trajectory planning. The visual prompt block utilizes semantic segmentation masks as visual prompts, enhancing the spatial understanding ability of pre‑trained visual‑language models for complex terrains. The COT‑ SC strategy effectively mitigates the error impact of outliers on planning performance through a multi‑path reasoning mechanism. Experimental results on the RELLIS‑3D off‑road dataset demonstrate that OFF‑EMMA significantly outperforms existing methods, reducing the average L2 error of the Qwen backbone model by 13.3% and decreasing the failure rate from 16.52% to 6.56%.

Abstract:
Semantic segmentation on point clouds is critical for 3D scene understanding. However, sparse and irregular point distributions provide limited appearance evidence, making geometry‑only features insufficient to distinguish objects with similar shapes but distinct appearances (e.g., color, texture, material). We propose Gaussian‑to‑Point (G2P), which transfers appearance‑aware attributes from 3D Gaussian Splatting to point clouds for more discriminative and appearance‑consistent segmentation. Our G2P address the misalignment between optimized Gaussians and original point geometry by establishing point‑wise correspondences. By leveraging Gaussian opacity attributes, we resolve the geometric ambiguity that limits existing models. Additionally, Gaussian scale attributes enable precise boundary localization in complex 3D scenes. Extensive experiments demonstrate that our approach achieves superior performance on standard benchmarks and shows significant improvements on geometrically challenging classes, all without any 2D or language supervision.

Abstract:
Earth vision has achieved milestones in geospatial object recognition but lacks exploration in object‑relational reasoning, limiting comprehensive scene understanding. To address this, a progressive Earth vision‑language understanding and generation framework is proposed, including a multi‑task dataset (EarthVLSet) and a semantic‑guided network (EarthVLNet). Focusing on city planning applications, EarthVLSet includes 10.9k sub‑meter resolution remote sensing images, land‑cover masks, and 761.5k textual pairs involving both multiple‑choice and open‑ended visual question answering (VQA) tasks. In an object‑centric way, EarthVLNet is proposed to progressively achieve semantic segmentation, relational reasoning, and comprehensive understanding. The first stage involves land‑cover segmentation to generate object semantics for VQA guidance. Guided by pixel‑wise semantics, the object awareness based large language model (LLM) performs relational reasoning and knowledge summarization to generate the required answers. As for optimization, the numerical difference loss is proposed to dynamically add difference penalties, addressing the various objects' statistics. Three benchmarks, including semantic segmentation, multiple‑choice, and open‑ended VQA demonstrated the superiorities of EarthVLNet, yielding three future directions: 1) segmentation features consistently enhance VQA performance even in cross‑dataset scenarios; 2) multiple‑choice tasks show greater sensitivity to the vision encoder than to the language decoder; and 3) open‑ended tasks necessitate advanced vision encoders and language decoders for an optimal performance. We believe this dataset and method will provide a beneficial benchmark that connects ''image‑mask‑text'', advancing geographical applications for Earth vision.

Abstract:
Agile locomotion in legged robots poses significant challenges for visual perception. Traditional frame‑based cameras often fail in these scenarios for producing blurred images, particularly under low‑light conditions. In contrast, event cameras capture changes in brightness asynchronously, offering low latency, high temporal resolution, and high dynamic range. These advantages make them suitable for robust perception during rapid motion and under challenging illumination. However, existing event camera datasets exhibit limitations in stereo configurations and multi‑band sensing domains under various illumination conditions. To address this gap, we present M‑SEVIQ, a multi‑band stereo event visual and inertial quadruped dataset collected using a Unitree Go2 equipped with stereo event cameras, a frame‑based camera, an inertial measurement unit (IMU), and joint encoders. This dataset contains more than 30 real‑world sequences captured across different velocity levels, illumination wavelengths, and lighting conditions. In addition, comprehensive calibration data, including intrinsic, extrinsic, and temporal alignments, are provided to facilitate accurate sensor fusion and benchmarking. Our M‑SEVIQ can be used to support research in agile robot perception, sensor fusion, semantic segmentation and multi‑modal vision in challenging environments.

Abstract:
Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace‑Beltrami operator to capture fine‑grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without relying solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross‑scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.

Abstract:
This paper presents a novel 3D semantic segmentation method for large‑scale point cloud data that does not require annotated 3D training data or paired RGB images. The proposed approach projects 3D point clouds onto 2D images using virtual cameras and performs semantic segmentation via a foundation 2D model guided by natural language prompts. 3D segmentation is achieved by aggregating predictions from multiple viewpoints through weighted voting. Our method outperforms existing training‑free approaches and achieves segmentation accuracy comparable to supervised methods. Moreover, it supports open‑vocabulary recognition, enabling users to detect objects using arbitrary text queries, thus overcoming the limitations of traditional supervised approaches.

Abstract:
This research aims to develop a novel deep learning network, GBU‑Net, utilizing a group‑batch‑normalized U‑Net framework, specifically designed for the precise semantic segmentation of the left ventricle in short‑axis cine MRI scans. The methodology includes a down‑sampling pathway for feature extraction and an up‑sampling pathway for detail restoration, enhanced for medical imaging. Key modifications include techniques for better contextual understanding crucial in cardiac MRI segmentation. The dataset consists of 805 left ventricular MRI scans from 45 patients, with comparative analysis using established metrics such as the dice coefficient and mean perpendicular distance. GBU‑Net significantly improves the accuracy of left ventricle segmentation in cine MRI scans. Its innovative design outperforms existing methods in tests, surpassing standard metrics like the dice coefficient and mean perpendicular distance. The approach is unique in its ability to capture contextual information, often missed in traditional CNN‑based segmentation. An ensemble of the GBU‑Net attains a 97% dice score on the SunnyBrook testing dataset. GBU‑Net offers enhanced precision and contextual understanding in left ventricle segmentation for surgical robotics and medical analysis.

Abstract:
In this paper, we revisit multimodal few‑shot 3D point cloud semantic segmentation (FS‑PCS), identifying a conflict in "Fuse‑then‑Refine" paradigms: the "Plasticity‑Stability Dilemma." In addition, CLIP's inter‑class confusion can result in semantic blindness. To address these issues, we present the Decoupled‑experts Arbitration Few‑Shot SegNet (DA‑FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA‑FSS employs the same backbone and pre‑trained text encoder as MM‑FSS to generate text embeddings, which can increase free modalities' utilization rate and better leverage each modality's information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA‑FSS over MM‑FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA‑FSS.

Abstract:
Open‑Set Domain Adaptation for Semantic Segmentation (OSDA‑SS) presents a significant challenge, as it requires both domain adaptation for known classes and the distinction of unknowns. Existing methods attempt to address both tasks within a single unified stage. We question this design, as the annotation imbalance between known and unknown classes often leads to negative transfer of known classes and underfitting for unknowns. To overcome these issues, we propose SATS, a Separating‑then‑Adapting Training Strategy, which addresses OSDA‑SS through two sequential steps: known/unknown separation and unknown‑aware domain adaptation. By providing the model with more accurate and well‑aligned unknown classes, our method ensures a balanced learning of discriminative features for both known and unknown classes, steering the model toward discovering truly unknown objects. Additionally, we present hard unknown exploration, an innovative data augmentation method that exposes the model to more challenging unknowns, strengthening its ability to capture more comprehensive understanding of target unknowns. We evaluate our method on public OSDA‑SS benchmarks. Experimental results demonstrate that our method achieves a substantial advancement, with a +3.85% H‑Score improvement for GTA5‑to‑Cityscapes and +18.64% for SYNTHIA‑to‑Cityscapes, outperforming previous state‑of‑the‑art methods.

Abstract:
This paper introduces SENA (SEamlessly NAtural), a geometry‑driven image stitching approach that prioritizes structural fidelity in challenging real‑world scenes characterized by parallax and depth variation. Conventional image stitching relies on homographic alignment, but this rigid planar assumption often fails in dual‑camera setups with significant scene depth, leading to distortions such as visible warps and spherical bulging. SENA addresses these fundamental limitations through three key contributions. First, we propose a hierarchical affine‑based warping strategy, combining global affine initialization with local affine refinement and smooth free‑form deformation. This design preserves local shape, parallelism, and aspect ratios, thereby avoiding the hallucinated structural distortions commonly introduced by homography‑based models. Second, we introduce a geometry‑driven adequate zone detection mechanism that identifies parallax‑minimized regions directly from the disparity consistency of RANSAC‑filtered feature correspondences, without relying on semantic segmentation. Third, building upon this adequate zone, we perform anchor‑based seamline cutting and segmentation, enforcing a one‑to‑one geometric correspondence across image pairs by construction, which effectively eliminates ghosting, duplication, and smearing artifacts in the final panorama. Extensive experiments conducted on challenging datasets demonstrate that SENA achieves alignment accuracy comparable to leading homography‑based methods, while significantly outperforming them in critical visual metrics such as shape preservation, texture integrity, and overall visual realism.

Abstract:
Conceal dense prediction (CDP), especially RGB‑D camouflage object detection and open‑vocabulary camouflage object segmentation, plays a crucial role in advancing the understanding and reasoning of complex camouflage scenes. However, high‑quality and large‑scale camouflage datasets with dense annotation remain scarce due to expensive data collection and labeling costs. To address this challenge, we explore leveraging generative models to synthesize realistic camouflage image‑dense data for training CDP models with fine‑grained representations, prior knowledge, and auxiliary reasoning. Concretely, our contributions are threefold: (i) we introduce GenCAMO‑DB, a large‑scale camouflage dataset with multi‑modal annotations, including depth maps, scene graphs, attribute descriptions, and text prompts; (ii) we present GenCAMO, an environment‑aware and mask‑free generative framework that produces high‑fidelity camouflage image‑dense annotations; (iii) extensive experiments across multiple modalities demonstrate that GenCAMO significantly improves dense prediction performance on complex camouflage scenes by providing high‑quality synthetic data. The code and datasets will be released after paper acceptance.

Abstract:
Different types of liquids such as water, wine and medicine appear in all aspects of daily life. However, limited attention has been given to the task, hindering the ability of robots to avoid or interact with liquids safely. The segmentation of liquids is difficult because liquids come in diverse appearances and shapes; moreover, they can be both transparent or reflective, taking on arbitrary objects and scenes from the background or surroundings. To take on this challenge, we construct a large‑scale dataset of liquids named LQDS consisting of 5000 real‑world images annotated into 14 distinct classes, and design a novel liquid detection model named LQDM, which leverages cross‑attention between a dedicated boundary branch and the main segmentation branch to enhance segmentation predictions. Extensive experiments demonstrate the effectiveness of LQDM on the test set of LQDS, outperforming state‑of‑the‑art methods and establishing a strong baseline for the semantic segmentation of liquids.

Abstract:
In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer ‑ an efficient RGB‑D Transformer‑based approach that predicts dense text‑aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha‑CLIP to guide our efficient student model DVEFormer in learning fine‑grained pixel‑wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text‑based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real‑time requirements, operating at 26.3 FPS for the full model and 77.0 FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real‑world applications. Overall, our method serves as a drop‑in replacement for traditional segmentation approaches while enabling flexible natural‑language querying and seamless integration into 3D mapping pipelines for mobile robotics.

Abstract:
Rigorous crop counting is crucial for effective agricultural management and informed intervention strategies. However, in outdoor field environments, partial occlusions combined with inherent ambiguity in distinguishing clustered crops from individual viewpoints poses an immense challenge for image‑based segmentation methods. To address these problems, we introduce a novel crop counting framework designed for exact enumeration via 3D instance segmentation. Our approach utilizes 2D images captured from multiple viewpoints and associates independent instance masks for neural radiance field (NeRF) view synthesis. We introduce crop visibility and mask consistency scores, which are incorporated alongside 3D information from a NeRF model. This results in an effective segmentation of crop instances in 3D and highly‑accurate crop counts. Furthermore, our method eliminates the dependence on crop‑specific parameter tuning. We validate our framework on three agricultural datasets consisting of cotton bolls, apples, and pears, and demonstrate consistent counting performance despite major variations in crop color, shape, and size. A comparative analysis against the state of the art highlights superior performance on crop counting tasks. Lastly, we contribute a cotton plant dataset to advance further research on this topic.

Abstract:
3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel‑view synthesis. Recent methods extend multi‑view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two‑stage approach in which some rely on contrastive learning with hyperparameter‑sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel "Embedding‑to‑Label" process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard‑mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy‑Rooms datasets.

Abstract:
The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post‑disaster assessment, existing deep learning models lack datasets specifically designed for post‑disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)‑captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure‑from‑Motion (SfM) and Multi‑View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state‑of‑the‑art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA‑CNNs on this dataset, exposing significant limitations in existing methods for disaster‑stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post‑disaster scene understanding and response.

Abstract:
Understanding urban perception from street view imagery has become a central topic in urban analytics and human centered urban design. However, most existing studies treat urban scenes as static and largely ignore the role of dynamic elements such as pedestrians and vehicles, raising concerns about potential bias in perception based urban analysis. To address this issue, we propose a controlled framework that isolates the perceptual effects of dynamic elements by constructing paired street view images with and without pedestrians and vehicles using semantic segmentation and MLLM guided generative inpainting. Based on 720 paired images from Dongguan, China, a perception experiment was conducted in which participants evaluated original and edited scenes across six perceptual dimensions. The results indicate that removing dynamic elements leads to a consistent 30.97% decrease in perceived vibrancy, whereas changes in other dimensions are more moderate and heterogeneous. To further explore the underlying mechanisms, we trained 11 machine learning models using multimodal visual features and identified that lighting conditions, human presence, and depth variation were key factors driving perceptual change. At the individual level, 65% of participants exhibited significant vibrancy changes, compared with 35‑50% for other dimensions; gender further showed a marginal moderating effect on safety perception. Beyond controlled experiments, the trained model was extended to a city‑scale dataset to predict vibrancy changes after the removal of dynamic elements. The city level results reveal that such perceptual changes are widespread and spatially structured, affecting 73.7% of locations and 32.1% of images, suggesting that urban perception assessments based solely on static imagery may substantially underestimate urban liveliness.

Abstract:
3D meshes are a fundamental representation widely used in computer science and engineering. In robotics, they are particularly valuable because they capture objects in a form that aligns directly with how robots interact with the physical world, enabling core capabilities such as predicting stable grasps, detecting collisions, and simulating dynamics. Although automatic 3D mesh generation methods have shown promising progress in recent years, potentially offering a path toward real‑time robot perception, two critical challenges remain. First, generating high‑fidelity meshes is prohibitively slow for real‑time use, often requiring tens of seconds per object. Second, mesh generation by itself is insufficient. In robotics, a mesh must be contextually grounded, i.e., correctly segmented from the scene and registered with the proper scale and pose. Additionally, unless these contextual grounding steps remain efficient, they simply introduce new bottlenecks. In this work, we introduce an end‑to‑end system that addresses these challenges, producing a high‑quality, contextually grounded 3D mesh from a single RGB‑D image in under one second. Our pipeline integrates open‑vocabulary object segmentation, accelerated diffusion‑based mesh generation, and robust point cloud registration, each optimized for both speed and accuracy. We demonstrate its effectiveness in a real‑world manipulation task, showing that it enables meshes to be used as a practical, on‑demand representation for robotics perception and planning.

Abstract:
Egocentric Referring Video Object Segmentation (Ego‑RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first‑person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object‑action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego‑REferring Segmentation (CERES), a plug‑in causal framework that adapts strong, pre‑trained RVOS backbones to the egocentric domain. CERES implements dual‑modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front‑door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state‑of‑the‑art performance on Ego‑RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

Abstract:
Semantic segmentation is a fundamental task in computer vision with wide‑ranging applications, including autonomous driving and robotics. While RGB‑based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low‑light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual‑branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross‑modal ambiguity, we introduce the Dual‑Dimensional Interaction Module (DDIM), comprising a Cross‑Spatial Interaction Module (CSIM) and a Cross‑Temporal Interaction Module (CTIM), which jointly perform fine‑grained fusion along both spatial and temporal dimensions. This design improves cross‑modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state‑of‑the‑art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image‑level representations of CLIP, which lack precise pixel‑level details. Existing training‑free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand‑crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub‑optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static‑fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically‑guided cross‑attention block, using robust deep features (K, V) to select and refine detail‑rich shallow features (Q), followed by a self‑attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general‑purpose dataset (e.g., COCO‑Stuff), ARM acts as a universal plug‑and‑play post‑processor for diverse training‑free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training‑free OVSS.

Abstract:
Accurate segmentation of the tooth point cloud is of great significance for diagnosis clinical assisting and treatment planning. Existing methods mostly employ semantic segmentation, focusing on the semantic feature between different types of teeth. However, due to the tightly packed structure of teeth, unclear boundaries, and the diversity of complex cases such as missing teeth, malposed teeth, semantic segmentation often struggles to achieve satisfactory results when dealing with complex dental cases. To address these issues, this paper propose BATISNet, a boundary‑aware instance network for tooth point cloud segmentation. This network model consists of a feature extraction backbone and an instance segmentation module. It not only focuses on extracting the semantic features of different types of teeth but also learns the instance features of individual teeth. It helps achieve more robust and accurate tooth instance segmentation in complex clinical scenarios such as missing teeth and malposed teeth. Additionally, to further enhance the completeness and accuracy of tooth boundary segmentation, a boundary‑aware loss function is designed to specifically supervise the boundary segmentation between instances. It mitigates effectively tooth adhesion and boundary ambiguity issues. Extensive experimental results show that BATISNet outperforms existing methods in tooth integrity segmentation, providing more reliable and detailed data support for practical clinical applications.

Abstract:
Glacial Lake Outburst Floods (GLOFs) are one of the most devastating climate change induced hazards. Existing remote monitoring approaches often prioritise maximising spatial coverage to train generalistic models or rely on optical imagery hampered by persistent cloud coverage. This paper presents an end‑to‑end, automated deep learning pipeline for the targeted monitoring of high‑risk Himalayan glacial lakes using time‑series Sentinel‑1 SAR. We introduce a "temporal‑first" training strategy, utilising a U‑Net with an EfficientNet‑B3 backbone trained on a curated dataset of a cohort of 4 lakes (Tsho Rolpa, Chamlang Tsho, Tilicho and Gokyo Lake). The model achieves an IoU of 0.9130 validating the success and efficacy of the "temporal‑first" strategy required for transitioning to Early Warning Systems. Beyond the model, we propose an operational engineering architecture: a Dockerised pipeline that automates data ingestion via the ASF Search API and exposes inference results via a RESTful endpoint. This system shifts the paradigm from static mapping to dynamic and automated early warning, providing a scalable architectural foundation for future development in Early Warning Systems.

Abstract:
Text‑guided object segmentation requires both cross‑modal reasoning and pixel grounding abilities. Most recent methods treat text‑guided segmentation as one‑shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi‑turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re‑localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi‑turn reasoning segmentation trajectories, and train RSAgent with a two‑stage framework: cold‑start supervised fine‑tuning followed by agentic reinforcement learning with fine‑grained, task‑specific rewards. Extensive experiments show that RSAgent achieves a zero‑shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg‑Zero‑7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state‑of‑the‑art performance on both in‑domain and out‑of‑domain benchmarks.

Abstract:
Self‑supervised semantic segmentation methods often suffer from structural errors, including merging distinct objects or fragmenting coherent regions, because they rely primarily on low‑level appearance cues such as color and texture. These cues lack structural discriminability: they carry no information about the structural organization of a region, making it difficult to distinguish boundaries between similar‑looking objects or maintain coherence within internally varying regions. Recent approaches attempt to address this by incorporating depth priors, yet remain limited by not explicitly modeling structural complexity that persists even when appearance cues are ambiguous. To bridge this gap, we present MSSSeg, a framework that explicitly learns multi‑scale structural complexity from both semantic and depth domains, via three coupled components: (1) a Differentiable Box‑Counting (DBC) module that captures and aligns multi‑scale structural complexity features with semantic features; (2) a Learnable Structural Augmentation (StructAug) that corrupts pixel‑intensity patterns, forcing the network to rely on structural complexity features from DBC; and (3) a Persistent Homology Loss (PHLoss) that directly supervises the structural complexity of predicted segmentations. Extensive experiments demonstrate that MSSSeg achieves new state‑of‑the‑art performance on COCO‑Stuff‑27, Cityscapes, and Potsdam without excessive computational overhead, validating that explicit structural complexity learning is crucial for self‑supervised segmentation.

Abstract:
Three‑dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth‑gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary‑aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics‑enhanced, order‑aware 2D‑3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point‑wise residual gating module injects occlusal‑view SAM embeddings into 3D point features to refine tooth‑gingiva and inter‑tooth boundaries. Second, a center‑guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order‑aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity‑based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg'22, SOFTooth achieves state‑of‑the‑art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine‑tuning.

Abstract:
Visual Simultaneous Localization and Mapping (vSLAM) systems encounter substantial challenges in dynamic environments where moving objects compromise tracking accuracy and map consistency. This paper introduces PCR‑ORB (Point Cloud Refinement ORB), an enhanced ORB‑SLAM3 framework that integrates deep learning‑based point cloud refinement to mitigate dynamic object interference. Our approach employs YOLOv8 for semantic segmentation combined with CUDA‑accelerated processing to achieve real‑time performance. The system implements a multi‑stage filtering strategy encompassing ground plane estimation, sky region removal, edge filtering, and temporal consistency validation. Comprehensive evaluation on the KITTI dataset (sequences 00‑09) demonstrates performance characteristics across different environmental conditions and scene types. Notable improvements are observed in specific sequences, with sequence 04 achieving 25.9% improvement in ATE RMSE and 30.4% improvement in ATE median. However, results show mixed performance across sequences, indicating scenario‑dependent effectiveness. The implementation provides insights into dynamic object filtering challenges and opportunities for robust navigation in complex environments.

Abstract:
Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real‑time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information‑Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian‑based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross‑boundary blending and preserves edge structures under high compression. We also introduce a warm‑up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.

Abstract:
Understanding road scenes for visual perception remains crucial for intelligent self‑driving cars. In particular, it is desirable to detect unexpected small road hazards reliably in real‑time, especially under varying adverse conditions (e.g., weather and daylight). However, existing road driving datasets provide large‑scale images acquired in either normal or adverse scenarios only, and often do not contain the road obstacles captured in the same visual domain as for the other classes. To address this, we introduce a new dataset called AVOID, the Adverse Visual Conditions Dataset, for real‑time obstacle detection collected in a simulated environment. AVOID consists of a large set of unexpected road obstacles located along each path captured under various weather and time conditions. Each image is coupled with the corresponding semantic and depth maps, raw and semantic LiDAR data, and waypoints, thereby supporting most visual perception tasks. We benchmark the results on high‑performing real‑time networks for the obstacle detection task, and also propose and conduct ablation studies using a comprehensive multi‑task network for semantic segmentation, depth and waypoint prediction tasks.

Abstract:
Understanding semantics and dynamics has been crucial for embodied agents in various tasks. Both tasks have much more data redundancy than the static scene understanding task. We formulate the view selection problem as an active learning problem, where the goal is to prioritize frames that provide the greatest information gain for model training. To this end, we propose an active learning algorithm with Fisher Information that quantifies the informativeness of candidate views with respect to both semantic Gaussian parameters and deformation networks. This formulation allows our method to jointly handle semantic reasoning and dynamic scene modeling, providing a principled alternative to heuristic or random strategies. We evaluate our method on large‑scale static images and dynamic video datasets by selecting informative frames from multi‑camera setups. Experimental results demonstrate that our approach consistently improves rendering quality and semantic segmentation performance, outperforming baseline methods based on random selection and uncertainty‑based heuristics.

Abstract:
This paper addresses the problem of decomposed 4D scene reconstruction from multi‑view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per‑image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.

Abstract:
Accurate, up‑to‑date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real‑time, privacy‑conscious sidewalk mapping on the ground, using recent‑generation iPhones and iPads. The system leverages on‑device semantic segmentation, LiDAR‑based depth estimation, and fused GPS/IMU data to detect and localize sidewalk‑relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user‑guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system's feature detection and spatial mapping performance reveal the application's potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user‑centered approach to closing critical data gaps in pedestrian

Abstract:
Prompt‑driven Video Segmentation Foundation Models (VSFMs), such as SAM2, are increasingly used in applications including autonomous driving and digital pathology, yet their security risks remain underexplored. We study backdoor attacks against VSFMs and show that directly applying classic attacks such as BadNet is largely ineffective, yielding attack success rates (ASR) below 5%. Through gradient‑similarity and attention‑map analyses, we find that traditional backdoor training fails because clean and triggered samples induce aligned image‑encoder gradients, while model attention remains focused on the prompt‑specified object rather than the trigger. To address this limitation, we propose BadVSFM, the first backdoor attack framework tailored to prompt‑driven VSFMs. BadVSFM uses a two‑stage strategy that first learns trigger‑specific encoder features and then trains the decoder to map triggered frame prompt representations to an attacker‑specified target mask while preserving clean segmentation behavior. Experiments on five VSFMs and two datasets show that BadVSFM achieves strong, controllable backdoor effects across triggers and prompt types with limited clean‑performance degradation. Ablations and interpretability analyses validate the necessity of the two‑stage design, and five representative defenses remain largely ineffective. Our results reveal a practical and underexplored vulnerability of current VSFMs to backdoor threats.

Abstract:
Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt‑driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text‑guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object‑centric textual descriptions (either user‑provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post‑encoder token pruning provides a practical and effective pathway to efficient, prompt‑aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer‑based video segmentation systems for real‑time and resource‑constrained applications.

Abstract:
Image fusion integrates complementary information from different modalities to generate high‑quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task‑specific techniques that primarily focus on consolidating inter‑modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC‑Mamba, a Self‑supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality‑Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial‑channel and frequency‑rotational scanning. The Multiplex Consensus Cross‑modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross‑modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi‑level Self‑supervised Contrastive Learning Loss (BSCL), which preserves high‑frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state‑of‑the‑art (SOTA) image fusion algorithms in tasks such as infrared‑visible, medical, multi‑focus, and multi‑exposure fusion, as well as downstream visual tasks.

Abstract:
Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task‑driven co‑design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end‑to‑end RAW‑to‑task pipeline. Building on DeepLens[19], our system integrates realistic cellphone‑scale lens models, learnable color filter arrays, Poisson‑Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI‑360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low‑light‑sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M‑parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co‑designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit‑depth. Together, these findings establish full‑stack co‑optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.

Abstract:
Camera‑based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end‑to‑end 2D‑to‑3D feature lifting and voxel completion. However, they often overlook the interference between high‑confidence visible‑region perception and low‑confidence occluded‑region reasoning caused by single‑image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel‑level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub‑tasks: visible‑region perception and occluded‑region reasoning. Building on this idea, we propose the Visible‑Occluded Interactive Completion Network (VOIC), a novel dual‑decoder framework that explicitly decouples SSC into visible‑region semantic perception and occluded‑region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth‑derived occupancy. The visible decoder focuses on generating high‑fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross‑modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench‑KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state‑of‑the‑art performance.

Abstract:
Understanding spatial openness is vital for improving residential quality and design; however, studies often treat its influencing factors separately. This study developed a quantitative framework to evaluate the spatial openness in housing from two‑ (2D) and three‑ (3D) dimensional perspectives. Using data from 4,004 rental units in Tokyo's 23 wards, we examined the temporal and spatial variations in openness and its relationship with rent and housing attributes. 2D openness was computed via planar visibility using visibility graph analysis (VGA) from floor plans, whereas 3D openness was derived from interior images analysed using Mask2Former, a semantic segmentation model that identifies walls, ceilings, floors, and windows. The results showed an increase in living room visibility and a 1990s peak in overall openness. Spatial analyses revealed partial correlations among openness, rent, and building characteristics, reflecting urban redevelopment trends. Although the 2D and 3D openness indicators were not directly correlated, higher openness tended to correspond to higher rent. The impression scores predicted by the existing models were only weakly related to openness, suggesting that the interior design and furniture more strongly shape perceived space. This study offers a new multidimensional data‑driven framework for quantifying residential spatial openness and linking it with urban and market dynamics.

Abstract:
Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually‑impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real‑time. We propose a region‑level, uncertainty‑gated retrieval mechanism that improves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection‑over‑union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always‑on baseline.

Abstract:
The advancement of safety‑critical research in driving behavior in ADAS‑equipped vehicles require real‑world datasets that not only include diverse traffic scenarios but also capture high‑risk edge cases such as near‑miss events and system failures. However, existing datasets are largely limited to either simulated environments or human‑driven vehicle data, lacking authentic ADAS (Advanced Driver Assistance System) vehicle behavior under risk conditions. To address this gap, this paper introduces SAVeD, a large‑scale video dataset curated from publicly available social media content, explicitly focused on ADAS vehicle‑related crashes, near‑miss incidents, and disengagements. SAVeD features 2,119 first‑person videos, capturing ADAS vehicle operations in diverse locations, lighting conditions, and weather scenarios. The dataset includes video frame‑level annotations for collisions, evasive maneuvers, and disengagements, enabling analysis of both perception and decision‑making failures. We demonstrate SAVeD's utility through multiple analyses and contributions: (1) We propose a novel framework integrating semantic segmentation and monocular depth estimation to compute real‑time Time‑to‑Collision (TTC) for dynamic objects. (2) We utilize the Generalized Extreme Value (GEV) distribution to model and quantify the extreme risk in crash and near‑miss events across different roadway types. (3) We establish benchmarks for state‑of‑the‑art VLLMs (VideoLLaMA2 and InternVL2.5 HiCo R16), showing that SAVeD's detailed annotations significantly enhance model performance through domain adaptation in complex near‑miss scenarios.

Abstract:
Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near‑complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.

Abstract:
Accurate flood detection from visual data is a critical step toward improving disaster response and risk assessment, yet datasets for flood segmentation remain scarce due to the challenges of collecting and annotating large‑scale imagery. Existing resources are often limited in geographic scope and annotation detail, hindering the development of robust, generalized computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive, publicly available aerial imagery dataset comprising 470 high‑resolution images from 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures global diversity and temporal relevance (2022‑2024), supporting three complementary tasks: (i) Image Classification with novel sub‑tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation providing precise pixel‑level masks for flood, sky, and buildings; and (iii) Visual Question Answering (VQA) to enable natural language reasoning for disaster assessment. We establish baseline benchmarks for all tasks using state‑of‑the‑art architectures, demonstrating the dataset's complexity and its value in advancing domain‑generalized AI tools for climate resilience.

Abstract:
Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self‑supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next‑Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet‑1k with next embedding prediction as its sole learning objective is effective ‑ no pixel reconstruction, discrete tokens, contrastive loss, or task‑specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top‑1 accuracy on ImageNet‑1K with ViT‑B and ViT‑L backbones after fine‑tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality‑agnostic alternative to visual self‑supervised learning.

Abstract:
Accurate surgical instrument segmentation in endoscopy is crucial for computer‑assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, and long‑term instrument re‑entry. While SAM3 provides a powerful spatio‑temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI‑SAM3, a training‑free extension of SAM3, that addresses these limitations through three components: (i) relevance‑aware memory filtering with a dedicated occlusion‑aware memory for storing pre‑occlusion frames, (ii) a piecewise interpolation scheme that expands effective memory capacity, and (iii) a feature‑based re‑identification module with temporal voting for reliable post‑occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17, EndoVis18 and CholecSeg8k under a zero‑shot setting show mcIoU improvements of around 5.8%, 8%, and 2% respectively, over vanilla SAM3, outperforming even prior training‑based approaches.

Abstract:
With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task‑oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM‑DiT) with unified triple attention and a plug‑and‑play sampling strategy guided by task feedback. Built upon the powerful DiT‑based generative foundation model, we systematically evaluate different control schemes, showing that a text‑image‑mask joint attention scheme combined with full fine‑tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few‑shot and complex‑scene scenarios. Furthermore, we propose a control‑rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high‑plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state‑of‑the‑art controllable generation methods, producing more stable and task‑oriented synthetic data for RS semantic segmentation.

Abstract:
Fine‑tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long‑term pre‑trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non‑causal factors, which usually reside in the low‑ and high‑frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non‑causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal‑Tune, a novel fine‑tuning strategy designed to extract causal factors and suppress non‑causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band‑pass filter is then applied to separate the spectrum into causal and non‑causal components. To further refine the causal components, we introduce a set of causal‑aware learnable tokens that operate in the frequency domain, while the non‑causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross‑domain tasks demonstrate the effectiveness of Causal‑Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

Abstract:
Omni‑modal models that have multimodal input and output are emerging. However, benchmarking their multimodal generation, especially in image generation, is challenging due to the subtleties of human preferences and model biases. Many image generation benchmarks focus on aesthetics instead of the fine‑grained generation capabilities of these models, failing to evaluate their visual intelligence with objective metrics. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine‑grained generative intelligence with pixel precision. With our benchmark and experiments, we find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero‑shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to dataset development, omni‑modal model development, and the design of metrics.

Abstract:
We propose a novel dataset that has been specifically designed for 3D semantic segmentation of bridges and the domain gap analysis caused by varying sensors. This addresses a critical need in the field of infrastructure inspection and maintenance, which is essential for modern society. The dataset comprises high‑resolution 3D scans of a diverse range of bridge structures from various countries, with detailed semantic labels provided for each. Our initial objective is to facilitate accurate and automated segmentation of bridge components, thereby advancing the structural health monitoring practice. To evaluate the effectiveness of existing 3D deep learning models on this novel dataset, we conduct a comprehensive analysis of three distinct state‑of‑the‑art architectures. Furthermore, we present data acquired through diverse sensors to quantify the domain gap resulting from sensor variations. Our findings indicate that all architectures demonstrate robust performance on the specified task. However, the domain gap can potentially lead to a decline in the performance of up to 11.4% mIoU.

Abstract:
Capabilities and the number of vision‑based models are increasing rapidly. And these vision models are now able to do more tasks like object detection, image classification, instance segmentation etc. with great accuracy. But models which can take accurate quantitative measurements form an image, as a human can do by just looking at it, are rare. For a robot to work with complete autonomy in a Laboratory environment, it needs to have some basic skills like navigation, handling objects, preparing samples etc. to match human‑like capabilities in an unstructured environment. Another important capability is to read measurements from instruments and apparatus. Here, we tried to mimic a human inspired approach to read measurements from a linear scale. As a test case we have picked reading level from a syringe and a measuring cylinder. For a randomly oriented syringe we carry out transformations to correct the orientation. To make the system efficient and robust, the area of interest is reduced to just the linear scale containing part of the image. After that, a series of features were extracted like the major makers, the corresponding digits, and the level indicator location, from which the final reading was calculated. Readings obtained using this system were also compared against human read values of the same instances and an accurate correspondence was observed.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) with image level labels aims to produce pixel level predictions without requiring dense annotations. While recent approaches have leveraged generative models to augment existing data, they remain dependent on real world training samples. In this paper, we introduce a novel direction, Zero Shot Weakly Supervised Semantic Segmentation (ZSWSSS), and propose SynthSeg Agents, a multi agent framework driven by Large Language Models (LLMs) to generate synthetic training data entirely without real images. SynthSeg Agents comprises two key modules, a Self Refine Prompt Agent and an Image Generation Agent. The Self Refine Prompt Agent autonomously crafts diverse and semantically rich image prompts via iterative refinement, memory mechanisms, and prompt space exploration, guided by CLIP based similarity and nearest neighbor diversity filtering. These prompts are then passed to the Image Generation Agent, which leverages Vision Language Models (VLMs) to synthesize candidate images. A frozen CLIP scoring model is employed to select high quality samples, and a ViT based classifier is further trained to relabel the entire synthetic dataset with improved semantic precision. Our framework produces high quality training data without any real image supervision. Experiments on PASCAL VOC 2012 and COCO 2014 show that SynthSeg Agents achieves competitive performance without using real training images. This highlights the potential of LLM driven agents in enabling cost efficient and scalable semantic segmentation.

Abstract:
We address the fundamental incompatibility of attention‑based encoder‑decoder (AED) models with long‑form acoustic encodings. AED models trained on segmented utterances learn to encode absolute frame positions by exploiting limited acoustic context beyond segment boundaries, but fail to generalize when decoding long‑form segments where these cues vanish. The model loses ability to order acoustic encodings due to permutation invariance of keys and values in cross‑attention. We propose four modifications: (1) injecting explicit absolute positional encodings into cross‑attention for each decoded segment, (2) long‑form training with extended acoustic context to eliminate implicit absolute position encoding, (3) segment concatenation to cover diverse segmentations needed during training, and (4) semantic segmentation to align AED‑decoded segments with training segments. We show these modifications close the accuracy gap between continuous and segmented acoustic encodings, enabling auto‑regressive use of the attention decoder.

Abstract:
With their high information density and intuitive readability, charts have become the de facto medium for data analysis and communication across disciplines. Recent multimodal large language models (MLLMs) have made notable progress in automated chart understanding, yet they remain heavily dependent on explicit textual annotations and the performance degrades markedly when key numerals are absent. To address this limitation, we introduce ChartAgent, a chart understanding framework grounded in Tool‑Integrated Reasoning (TIR). Inspired by human cognition, ChartAgent decomposes complex chart analysis into a sequence of observable, replayable steps. Supporting this architecture is an extensible, modular tool library comprising more than a dozen core tools, such as keyelement detection, instance segmentation, and optical character recognition (OCR), which the agent dynamically orchestrates to achieve systematic visual parsing across diverse chart types. Leveraging TIRs transparency and verifiability, ChartAgent moves beyond the black box paradigm by standardizing and consolidating intermediate outputs into a structured Evidence Package, providing traceable and reproducible support for final conclusions. Experiments show that ChartAgent substantially improves robustness under sparse annotation settings, offering a practical path toward trustworthy and extensible systems for chart understanding.

Abstract:
This paper provides a review of deep learning applications in scene understanding in autonomous robots, including innovations in object detection, semantic and instance segmentation, depth estimation, 3D reconstruction, and visual SLAM. It emphasizes how these techniques address limitations of traditional geometric models, improve depth perception in real time despite occlusions and textureless surfaces, and enhance semantic reasoning to understand the environment better. When these perception modules are integrated into dynamic and unstructured environments, they become more effective in decisionmaking, navigation and interaction. Lastly, the review outlines the existing problems and research directions to advance learning‑based scene understanding of autonomous robots.

Abstract:
This paper presents a new dataset for Novel View Synthesis, generated from a high‑quality, animated film with stunning realism and intricate detail. Our dataset captures a variety of dynamic scenes, complete with detailed textures, lighting, and motion, making it ideal for training and evaluating cutting‑edge 4D scene reconstruction and novel view generation models. In addition to high‑fidelity RGB images, we provide multiple complementary modalities, including depth, surface normals, object segmentation and optical flow, enabling a deeper understanding of scene geometry and motion. The dataset is organised into three distinct benchmarking scenarios: a dense multi‑view camera setup, a sparse camera arrangement, and monocular video sequences, enabling a wide range of experimentation and comparison across varying levels of data sparsity. With its combination of visual richness, high‑quality annotations, and diverse experimental setups, this dataset offers a unique resource for pushing the boundaries of view synthesis and 3D vision.

Abstract:
Semantic segmentation requires a holistic understanding of the physical world, as it assigns semantic labels to spatially continuous and structurally coherent objects rather than to isolated pixels. However, existing data‑free knowledge distillation (DFKD) methods‑primarily designed for classification‑often disregard this continuity, resulting in significant performance degradation when applied directly to segmentation tasks. In this paper, we introduce DFSS, a novel data‑free distillation framework tailored for semantic segmentation. Unlike prior approaches that treat pixels independently, DFSS respects the structural and contextual continuity of real‑world scenes. Our key insight is to leverage Batch Normalization (BN) statistics from a teacher model to guide Approximate Distribution Sampling (ADS), enabling the selection of data that better reflects the original training distribution‑without relying on potentially misleading teacher predictions. Additionally, we propose Weighted Distribution Progressive Distillation (WDPD), which dynamically prioritizes reliable samples that are more closely aligned with the original data distribution early in training and gradually incorporates more challenging cases, mirroring the natural progression of learning in human perception. Extensive experiments on standard benchmarks demonstrate that DFSS consistently outperforms existing data‑free distillation methods for semantic segmentation, achieving state‑of‑the‑art results with significantly reduced reliance on auxiliary data.

Abstract:
Given the inherently costly and time‑intensive nature of pixel‑level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground‑truth pixel‑level annotations has garnered increasing attention recently for training high‑performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image‑annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto‑encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.

Abstract:
Accurate medical image analysis can greatly assist clinical diagnosis, but its effectiveness relies on high‑quality expert annotations Obtaining pixel‑level labels for medical images, particularly fundus images, remains costly and time‑consuming. Meanwhile, despite the success of deep learning in medical imaging, the lack of interpretability limits its clinical adoption. To address these challenges, we propose TWLR, a two‑stage framework for interpretable diabetic retinopathy (DR) assessment. In the first stage, a vision‑language model integrates domain‑specific ophthalmological knowledge into text embeddings to jointly perform DR grading and lesion classification, effectively linking semantic medical concepts with visual features. The second stage introduces an iterative severity regression framework based on weakly‑supervised semantic segmentation. Lesion saliency maps generated through iterative refinement direct a progressive inpainting mechanism that systematically eliminates pathological features, effectively downgrading disease severity toward healthier fundus appearances. Critically, this severity regression approach achieves dual benefits: accurate lesion localization without pixel‑level supervision and providing an interpretable visualization of disease‑to‑healthy transformations. Experimental results on the FGADR, DDR, and a private dataset demonstrate that TWLR achieves competitive performance in both DR classification and lesion segmentation, offering a more explainable and annotation‑efficient solution for automated retinal image analysis.

Abstract:
Unsupervised domain adaptation (UDA) enables semantic segmentation models to generalize from a labeled source domain to an unlabeled target domain. However, existing UDA methods still struggle to bridge the domain gap due to cross‑domain contextual ambiguity, inconsistent feature representations, and class‑wise pseudo‑label noise. To address these challenges, we propose Omni‑level Masking for Unsupervised Domain Adaptation (OMUDA), a unified framework that introduces hierarchical masking strategies across distinct representation levels. Specifically, OMUDA comprises: 1) a Context‑Aware Masking (CAM) strategy that adaptively distinguishes foreground from background to balance global context and local details; 2) a Feature Distillation Masking (FDM) strategy that enhances robust and consistent feature learning through knowledge transfer from pre‑trained models; and 3) a Class Decoupling Masking (CDM) strategy that mitigates the impact of noisy pseudo‑labels by explicitly modeling class‑wise uncertainty. This hierarchical masking paradigm effectively reduces the domain shift at the contextual, representational, and categorical levels, providing a unified solution beyond existing approaches. Extensive experiments on multiple challenging cross‑domain semantic segmentation benchmarks validate the effectiveness of OMUDA. Notably, on the SYNTHIA‑>Cityscapes and GTA5‑>Cityscapes tasks, OMUDA can be seamlessly integrated into existing UDA methods and consistently achieving state‑of‑the‑art results with an average improvement of 7%.

Abstract:
After a wildfire, delineating burned areas (BAs) is crucial for quantifying damages and supporting ecosystem recovery. Current BA mapping approaches rely on computer vision models trained on post‑event remote sensing imagery, but often overlook their applicability to time‑constrained emergency management scenarios. This study introduces a supervised semantic segmentation workflow aimed at boosting both the performance and efficiency of BA delineation. It targets SPOT‑6/7 imagery due to its very high resolution and on‑demand availability. Experiments are evaluated based on Dice score, Intersection over Union, and inference time. The results show that U‑Net and SegFormer models perform similarly with limited training data. However, SegFormer requires more resources, challenging its practical use in emergencies. Incorporating land cover data as an auxiliary task enhances model robustness without increasing inference time. Lastly, Test‑Time Augmentation improves BA delineation performance but raises inference time, which can be mitigated with optimization methods like Mixed Precision.

Abstract:
This extended abstract details our solution for the Global Wheat Full Semantic Segmentation Competition. We developed a systematic self‑training framework. This framework combines a two‑stage hybrid training strategy with extensive data augmentation. Our core model is SegFormer with a Mix Transformer (MiT‑B4) backbone. We employ an iterative teacher‑student loop. This loop progressively refines model accuracy. It also maximizes data utilization. Our method achieved competitive performance. This was evident on both the Development and Testing Phase datasets.

Abstract:
3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real‑world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D‑3D projections. Since SAM2's performance depends on input prompts and its initial outputs often have deficiencies, and given its class‑agnostic nature, we introduce three light‑weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2's initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2's image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high‑resolution 3D teeth meshes, establishing a new state‑of‑the‑art in the field.

Abstract:
Recent advances in self‑supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut‑prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self‑distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token‑to‑prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn‑Knopp algorithm, Zipf‑Sinkhorn, which enforces a power‑law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state‑of‑the‑art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable‑point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.

Abstract:
Deep neural networks achieve superior performance in semantic segmentation, but are limited to a predefined set of classes, which leads to failures when they encounter unknown objects in open‑world scenarios. Recognizing and segmenting these out‑of‑distribution (OOD) objects is crucial for safety‑critical applications such as automated driving. In this work, we present an evidence segmentation framework using a Wasserstein loss, which captures distributional distances while respecting the probability simplex geometry. Combined with Kullback‑Leibler regularization and Dice structural consistency terms, our approach leads to improved OOD segmentation performance compared to uncertainty‑based approaches.

Abstract:
We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar‑error maps yields object‑level masks that closely follow thin structures; these masks (i) guide an object‑depth loss that sharpens the consistent video depth, and (ii) support skeleton‑based sampling plus mask‑guided re‑identification to produce reliable, comprehensive 2‑D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual‑view depth loss removes floaters, and a scaffold‑projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings

Abstract:
Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA‑based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

Abstract:
This paper proposes a large‑scale multi‑modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel‑level video understanding, we introduce MeViS, a dataset containing 33,072 human‑annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio‑guided video object segmentation (AVOS) methods, 2 referring multi‑object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression‑guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state‑of‑the‑art results. Our dataset provides a platform that facilitates the development of motion expression‑guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/

Abstract:
Few‑shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce Take a Peek (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross‑domain FSS (CD‑FSS). TaP leverages Low‑Rank Adaptation (LoRA) to fine‑tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model‑agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks‑‑including COCO 20^i, Pascal 5^i, and cross‑domain datasets such as DeepGlobe, ISIC, and Chest X‑ray‑‑demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi‑class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low‑rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS‑‑the encoder's generalization to novel classes‑‑TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.

Abstract:
Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision‑language models such as CONCH offer rich semantic alignment and morphology‑aware representations, while modern segmentation backbones like SegFormer preserve fine‑grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology‑aware representations from CONCH, multi‑scale structural cues from SegFormer, and text‑guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text‑guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo‑masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine‑grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high‑quality pseudo masks without pixel‑level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS‑WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.

Abstract:
Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image‑level labels, yet it remains limited by inter‑class homogeneity, intra‑class heterogeneity, and the region‑shrinkage effect of CAM‑based supervision. We propose a simple and effective prototype‑driven framework that leverages vision‑language alignment to improve region discovery under weak supervision. Our method integrates CoOp‑style learnable prompt tuning to generate text‑based prototypes and combines them with learnable image prototypes, forming a dual‑modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi‑scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS‑WSSS benchmark show that our approach surpasses existing state‑of‑the‑art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.

Abstract:
We present NordFKB, a fine‑grained benchmark dataset for geospatial AI in Norway, derived from the authoritative, highly accurate, national Felles KartdataBase (FKB). The dataset contains high‑resolution orthophotos paired with detailed annotations for 36 semantic classes, including both per‑class binary segmentation masks in GeoTIFF format and COCO‑style bounding box annotations. Data is collected from seven geographically diverse areas, ensuring variation in climate, topography, and urbanization. Only tiles containing at least one annotated object are included, and training/validation splits are created through random sampling across areas to ensure representative class and context distributions. Human expert review and quality control ensures high annotation accuracy. Alongside the dataset, we release a benchmarking repository with standardized evaluation protocols and tools for semantic segmentation and object detection, enabling reproducible and comparable research. NordFKB provides a robust foundation for advancing AI methods in mapping, land administration, and spatial planning, and paves the way for future expansions in coverage, temporal scope, and data modalities.

Abstract:
Class‑agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class‑agnostic 3D Instance SegmenTation, termed as ASSIST‑3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST‑3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM‑guided spatial reasoning combined with depth‑first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi‑view RGB‑D image rendering and fusion from the synthetic scenes, closely mimicking real‑world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST‑3D‑generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose‑built pipeline over existing 3D scene synthesis approaches.

Abstract:
This paper introduces ROI‑Packing, an efficient image compression method tailored specifically for machine vision. By prioritizing regions of interest (ROI) critical to end‑task accuracy and packing them efficiently while discarding less relevant data, ROI‑Packing achieves significant compression efficiency without requiring retraining or fine‑tuning of end‑task models. Comprehensive evaluations across five datasets and two popular tasks‑object detection and instance segmentation‑demonstrate up to a 44.10% reduction in bitrate without compromising end‑task accuracy, along with an 8.88 % improvement in accuracy at the same bitrate compared to the state‑of‑the‑art Versatile Video Coding (VVC) codec standardized by the Moving Picture Experts Group (MPEG).

Abstract:
Accurate 3D scene interpretation in active construction sites is essential for progress monitoring, safety assessment, and digital twin development. LiDAR is widely used in construction because it offers advantages over camera‑based systems, performing reliably in cluttered and dynamically changing conditions. Yet most public datasets for 3D perception are derived from densely fused scans with uniform sampling and complete visibility, conditions that do not reflect real construction sites. Field data are often collected as isolated single‑station LiDAR views, constrained by safety requirements, limited access, and ongoing operations. These factors lead to radial density decay, fragmented geometry, and view‑dependent visibility‑characteristics that remain underrepresented in existing datasets. This paper presents SIP, Site in Pieces, a dataset created to reflect the practical constraints of LiDAR acquisition during construction. SIP provides indoor and outdoor scenes captured with a terrestrial LiDAR scanner and annotated at the point level using a taxonomy tailored to construction environments: A. Built Environment, B. Construction Operations, and C. Site Surroundings. The dataset includes both structural components and slender temporary objects such as scaffolding, MEP piping, and scissor lifts, where sparsity caused by occlusion and fragmented geometry make segmentation particularly challenging. The scanning protocol, annotation workflow, and quality control procedures establish a consistent foundation for the dataset. SIP is openly available with a supporting Git repository, offering adaptable class configurations that streamline adoption within modern 3D deep learning frameworks. By providing field data that retain real‑world sensing characteristics, SIP enables robust benchmarking and contributes to advancing construction‑oriented 3D vision tasks.

Abstract:
Continual Test‑Time Adaptation (CTTA) enables pre‑trained models to adapt to continuously evolving domains. Existing methods have improved robustness but typically rely on fixed or batch‑level thresholds, which cannot account for varying difficulty across classes and instances. This limitation is especially problematic in semantic segmentation, where each image requires dense, multi‑class predictions. We propose an approach that adaptively adjusts pseudo labels to reflect the confidence distribution within each image and dynamically balances learning toward classes most affected by domain shifts. This fine‑grained, class‑ and instance‑aware adaptation produces more reliable supervision and mitigates error accumulation throughout continual adaptation. Extensive experiments across eight CTTA and TTA scenarios, including synthetic‑to‑real and long‑term shifts, show that our method consistently outperforms state‑of‑the‑art techniques, setting a new standard for semantic segmentation under evolving conditions.

Abstract:
Few‑shot 3D point cloud semantic segmentation (FS‑3DSeg) aims to segment novel classes with only a few labeled samples. However, existing metric‑based prototype learning methods generate prototypes solely from the support set, without considering their relevance to query data. This often results in prototype bias, where prototypes overfit support‑specific characteristics and fail to generalize to the query distribution, especially in the presence of distribution shifts, which leads to degraded segmentation performance. To address this issue, we propose a novel Query‑aware Hub Prototype (QHP) learning method that explicitly models semantic correlations between support and query sets. Specifically, we propose a Hub Prototype Generation (HPG) module that constructs a bipartite graph connecting query and support points, identifies frequently linked support hubs, and generates query‑relevant prototypes that better capture cross‑set semantics. To further mitigate the influence of bad hubs and ambiguous prototypes near class boundaries, we introduce a Prototype Distribution Optimization (PDO) module, which employs a purity‑reweighted contrastive loss to refine prototype representations by pulling bad hubs and outlier prototypes closer to their corresponding class centers. Extensive experiments on S3DIS and ScanNet demonstrate that QHP achieves substantial performance gains over state‑of‑the‑art methods, effectively narrowing the semantic gap between prototypes and query sets in FS‑3DSeg.

Abstract:
Accurate understanding of anatomical structures is essential for reliably staging certain dental diseases. A way of introducing this within semantic segmentation models is by utilising hierarchy‑aware methodologies. However, existing hierarchy‑aware segmentation methods largely encode anatomical structure through the loss functions, providing weak and indirect supervision. We introduce a general framework that embeds an explicit anatomical hierarchy into semantic segmentation by coupling a recurrent, level‑wise prediction scheme with restrictive output heads and top‑down feature conditioning. At each depth of the class tree, the backbone is re‑run on the original image concatenated with logits from the previous level. Child class features are conditioned using Feature‑wise Linear Modulation of their parent class probabilities, to modulate child feature spaces for fine grained detection. A probabilistic composition rule enforces consistency between parent and descendant classes. Hierarchical loss combines per‑level class weighted Dice and cross entropy loss and a consistency term loss, ensuring parent predictions are the sum of their children. We validate our approach on our proposed dataset, TL‑pano, containing 194 panoramic radiographs with dense instance and semantic segmentation annotations, of tooth layers and alveolar bone. Utilising UNet and HRNet as donor models across a 5‑fold cross validation scheme, the hierarchical variants consistently increase IoU, Dice, and recall, particularly for fine‑grained anatomies, and produce more anatomically coherent masks. However, hierarchical variants also demonstrated increased recall over precision, implying increased false positives. The results demonstrate that explicit hierarchical structuring improves both performance and clinical plausibility, especially in low data dental imaging regimes.

Abstract:
The recent SAM 3 and SAM 3D have introduced significant advancements over the predecessor, SAM 2, particularly with the integration of language‑based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero‑shot segmentation across a wide range of prompts, including point, bounding box, and language‑based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot‑assisted surgery, benchmarking its zero‑shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain‑specific training. Additionally, we investigate SAM 3D's depth reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while the zero‑shot evaluations of SAM 3D on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.

Abstract:
Benefiting from the inductive biases learned from large‑scale datasets, open‑vocabulary semantic segmentation (OVSS) leverages the power of vision‑language models, such as CLIP, to achieve remarkable progress without requiring task‑specific training. However, due to CLIP's pre‑training nature on image‑text pairs, it tends to focus on global semantic alignment, resulting in suboptimal performance when associating fine‑grained visual regions with text. This leads to noisy and inconsistent predictions, particularly in local areas. We attribute this to a dispersed bias stemming from its contrastive training paradigm, which is difficult to alleviate using CLIP features alone. To address this, we propose a structure‑aware feature rectification approach that incorporates instance‑specific priors derived directly from the image. Specifically, we construct a region adjacency graph (RAG) based on low‑level features (e.g., colour and texture) to capture local structural relationships and use it to refine CLIP features by enhancing local discrimination. Extensive experiments show that our method effectively suppresses segmentation noise, improves region‑level consistency, and achieves strong performance on multiple open‑vocabulary segmentation benchmarks.

Abstract:
Virtual representations of physical critical infrastructures, such as water or energy plants, are used for simulations and digital twins to ensure resilience and continuity of their services. These models usually require 3D point clouds from laser scanners that are expensive to acquire and require specialist knowledge to use. In this article, we present a prototypical graph generation pipeline based on photogrammetry. The pipeline detects relevant objects and predicts their relation using RGB images and depth data generated by a stereo camera. This more cost‑effective approach uses deep learning for object detection and instance segmentation of the objects, and employs user‑defined heuristics or rules to infer their relations. Results of two hydraulic systems show that this strategy can produce graphs close to the ground truth. While this study focuses on hydraulic systems, the general process can be used to tailor the method to other types of infrastructures and applications. The user‑defined rules create transparency qualifying the pipeline to be used in the high stakes decision‑making that is required for critical infrastructures.

Abstract:
Glass is a prevalent material among solid objects in everyday life, yet segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While it is known that human perception relies on boundary and reflective‑object features to distinguish glass objects, the existing literature has not yet sufficiently captured both properties when handling transparent objects. Hence, we propose incorporating both of these powerful visual cues via the Boundary Feature Enhancement and Reflection Feature Enhancement modules in a mutually beneficial way. Our proposed framework, TransCues, is a pyramidal transformer encoder‑decoder architecture to segment transparent objects. We empirically show that these two modules can be used together effectively, improving overall performance across various benchmark datasets, including glass object semantic segmentation, mirror object semantic segmentation, and generic segmentation datasets. Our method outperforms the state‑of‑the‑art by a large margin, achieving +4.2% mIoU on Trans10K‑v2, +5.6% mIoU on MSD, +10.1% mIoU on RGBD‑Mirror, +13.1% mIoU on TROSD, and +8.3% mIoU on Stanford2D3D, showing the effectiveness of our method against glass objects.

Abstract:
This paper proposes a novel self‑supervised learning method for semantic segmentation using selective masking image reconstruction as the pretraining task. Our proposed method replaces the random masking augmentation used in most masked image modelling pretraining methods. The proposed selective masking method selectively masks image patches with the highest reconstruction loss by breaking the image reconstruction pretraining into iterative steps to leverage the trained model's knowledge. We show on two general datasets (Pascal VOC and Cityscapes) and two weed segmentation datasets (Nassar 2020 and Sugarbeets 2016) that our proposed selective masking method outperforms the traditional random masking method and supervised ImageNet pretraining on downstream segmentation accuracy by 2.9% for general datasets and 2.5% for weed segmentation datasets. Furthermore, we found that our selective masking method significantly improves accuracy for the lowest‑performing classes. Lastly, we show that using the same pretraining and downstream dataset yields the best result for low‑budget self‑supervised pretraining. Our proposed Selective Masking Image Reconstruction method provides an effective and practical solution to improve end‑to‑end semantic segmentation workflows, especially for scenarios that require limited model capacity to meet inference speed and computational resource requirements.

Abstract:
Reliable 3D segmentation is critical for understanding complex scenes with dense layouts and multi‑scale objects, as commonly seen in industrial environments. In such scenarios, heavy occlusion weakens geometric boundaries between objects, and large differences in object scale will cause end‑to‑end models fail to capture both coarse and fine details accurately. Existing 3D point‑based methods require costly annotations, while image‑guided methods often suffer from semantic inconsistencies across views. To address these challenges, we propose a hierarchical image‑guided 3D segmentation framework that progressively refines segmentation from instance‑level to part‑level. Instance segmentation involves rendering a top‑view image and projecting SAM‑generated masks prompted by YOLO‑World back onto the 3D point cloud. Part‑level segmentation is subsequently performed by rendering multi‑view images of each instance obtained from the previous stage and applying the same 2D segmentation and back‑projection process at each view, followed by Bayesian updating fusion to ensure semantic consistency across views. Experiments on real‑world factory data demonstrate that our method effectively handles occlusion and structural complexity, achieving consistently high per‑class mIoU scores. Additional evaluations on public dataset confirm the generalization ability of our framework, highlighting its robustness, annotation efficiency, and adaptability to diverse 3D environments.

Abstract:
Federated Learning (FL) enables collaborative training of autonomous driving (AD) models across distributed vehicles while preserving data privacy. However, FL encounters critical challenges such as poor generalization and slow convergence due to non‑independent and identically distributed (non‑IID) data from diverse driving environments. To overcome these obstacles, we introduce Federated Deep Supervision and Regularization (FedDSR), a paradigm that incorporates multi‑access intermediate layer supervision and regularization within federated AD system. Specifically, FedDSR comprises following integral strategies: (I) to select multiple intermediate layers based on predefined architecture‑agnostic standards. (II) to compute mutual information (MI) and negative entropy (NE) on those selected layers to serve as intermediate loss and regularizer. These terms are integrated into the output‑layer loss to form a unified optimization objective, enabling comprehensive optimization across the network hierarchy. (III) to aggregate models from vehicles trained based on aforementioned rules of (I) and (II) to generate the global model on central server. By guiding and penalizing the learning of feature representations at intermediate stages, FedDSR enhances the model generalization and accelerates model convergence for federated AD. We then take the semantic segmentation task as an example to assess FedDSR and apply FedDSR to multiple model architectures and FL algorithms. Extensive experiments demonstrate that FedDSR achieves up to 8.93% improvement in mIoU and 28.57% reduction in training rounds, compared to other FL baselines, making it highly suitable for practical deployment in federated AD ecosystems.

Abstract:
Autonomous driving (AD) scenarios are inherently complex and diverse, posing significant challenges for a single deep learning model to effectively cover all possible conditions, such as varying weather, traffic densities, and road types. Large Model (LM)‑Driven Mixture of Experts (MoE) paradigm offers a promising solution, where LM serves as the backbone to extract latent features while MoE serves as the downstream head to dynamically select and aggregate specialized experts to adapt to different scenarios. However, routing and aggregating in MoE face intrinsic challenges, including imprecise expert selection due to flawed routing strategy and inefficient expert aggregation leading to suboptimal prediction. To address these issues, we propose a statistic‑augmented, decoupled MoE outing and Aggregating Mechanism (MoE‑RAM) driven by LM. Specifically, on the one hand, MoE‑RAM enhances expert routing by incorporating statistical retrieval mechanism to match LM‑extracted latent features with cached prototypical features of the most relevant experts; on the other hand, MoE‑RAM adaptively reweights experts' outputs in fusion by measuring statistical distances of experts' instant features against LM‑extracted latent features. Benefiting from the synergy of the statistic‑augmented MoE's routing and aggregating, MoE‑RAM ultimately improves the prediction performance. We take the AD semantic segmentation task as an example to assess the proposed MoE‑RAM. Extensive experiments on AD datasets demonstrate the superiority of MoE‑RAM compared to other MoE baselines and conventional single‑model approaches.

Abstract:
Large Vision‑Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision‑language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception‑centric tasks ‑‑ such as object detection, semantic segmentation, and depth estimation ‑‑ remains significantly inferior to that of task‑specific expert models. For example, Qwen2.5‑VL‑7B‑Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain‑of‑Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding ‑‑ each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5‑VL‑7B‑Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

Abstract:
Recent text‑to‑video models have enabled the generation of high‑resolution driving scenes from natural language prompts. These AI‑generated driving videos (AIGVs) offer a low‑cost, scalable alternative to real or simulator data for autonomous driving (AD). But a key question remains: can such videos reliably support training and evaluation of AD models? We present a diagnostic framework that systematically studies this question. First, we introduce a taxonomy of frequent AIGV failure modes, including visual artifacts, physically implausible motion, and violations of traffic semantics, and demonstrate their negative impact on object detection, tracking, and instance segmentation. To support this analysis, we build ADGV‑Bench, a driving‑focused benchmark with human quality annotations and dense labels for multiple perception tasks. We then propose ADGVE, a driving‑aware evaluator that combines static semantics, temporal cues, lane obedience signals, and Vision‑Language Model(VLM)‑guided reasoning into a single quality score for each clip. Experiments show that blindly adding raw AIGVs can degrade perception performance, while filtering them with ADGVE consistently improves both general video quality assessment metrics and downstream AD models, and turns AIGVs into a beneficial complement to real‑world data. Our study highlights both the risks and the promise of AIGVs, and provides practical tools for safely leveraging large‑scale video generation in future AD pipelines.

Abstract:
Deep Neural Networks are vulnerable to small perturbations that can drastically alter their predictions for perceptually unchanged inputs. The literature on adversarially robust Deep Learning attempts to either enhance the robustness of neural networks (e.g, via adversarial training) or to certify their decisions up to a given robustness level (e.g, by using randomized smoothing, formal methods or Lipschitz bounds). These studies mostly focus on classification tasks and few efficient certification procedures currently exist for semantic segmentation. In this work, we introduce a new class of certifiably robust Semantic Segmentation networks with built‑in Lipschitz constraints that are efficiently trainable and achieve competitive pixel accuracy on challenging datasets such as Cityscapes. Additionally, we provide a novel framework that generalizes robustness certificates for semantic segmentation tasks, where we showcase the flexibility and computational efficiency of using Lipschitz networks. Our approach unlocks real‑time compatible certifiably robust semantic segmentation for the first time. Moreover, it allows the computation of worst‑case performance under \ell_2 attacks of radius ε across a wide range of performance measures. Crucially, we benchmark the runtime of our certification process and find our approach to be around 600 times faster than randomized smoothing methods at inference with comparable certificates on an NVIDIA A100 GPU. Finally, we evaluate the tightness of our worstcase certificates against state‑of‑the‑art adversarial attacks to further validate the performance of our method.

Abstract:
Self‑supervised contrastive learning is among the recent representation learning methods that have shown performance gains in several downstream tasks including semantic segmentation. This paper evaluates strong data augmentation, one of the most important components for self‑supervised contrastive learning's improved performance. Strong data augmentation involves applying the composition of multiple augmentation techniques on images. Surprisingly, we find that the existing data augmentations do not always improve performance for semantic segmentation for medical images. We experiment with other augmentations that provide improved performance.

Abstract:
Weakly supervised semantic segmentation (WSSS) in histopathology reduces pixel‑level labeling by learning from image‑level labels, but it is hindered by inter‑class homogeneity, intra‑class heterogeneity, and CAM‑induced region shrinkage (global pooling‑based class activation maps whose activations highlight only the most distinctive areas and miss nearby class regions). Recent works address these challenges by constructing a clustering prototype bank and then refining masks in a separate stage; however, such two‑stage pipelines are costly, sensitive to hyperparameters, and decouple prototype discovery from segmentation learning, limiting their effectiveness and efficiency. We propose a cluster‑free, one‑stage learnable‑prototype framework with diversity regularization to enhance morphological intra‑class heterogeneity coverage. Our approach achieves state‑of‑the‑art (SOTA) performance on BCSS‑WSSS, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels, and activation heatmaps further reveal that, compared with clustering‑based prototypes, our learnable prototypes cover more diverse and complementary regions within each class, providing consistent qualitative evidence for their effectiveness.

Abstract:
Semantic segmentation of 3D point cloud data often comes with high annotation costs. Active learning automates the process of selecting which data to annotate, reducing the total amount of annotation needed to achieve satisfactory performance. Recent approaches to active learning for 3D point clouds are often based on sophisticated heuristics for both, splitting point clouds into annotatable regions and selecting the most beneficial for further neural network training. In this work, we propose a novel and easy‑to‑implement strategy to separate the point cloud into annotatable regions. In our approach, we utilize a 2D grid to subdivide the point cloud into columns. To identify the next data to be annotated, we employ a network ensemble to estimate the uncertainty in the network output. We evaluate our method on the S3DIS dataset, the Toronto‑3D dataset, and a large‑scale urban 3D point cloud of the city of Freiburg, which we labeled in parts manually. The extensive evaluation shows that our method yields performance on par with, or even better than, complex state‑of‑the‑art methods on all datasets. Furthermore, we provide results suggesting that in the context of point clouds the annotated area can be a more meaningful measure for active learning algorithms than the number of annotated points.

Abstract:
The increasing availability of Earth observation data offers unprecedented opportunities for large‑scale environmental monitoring and analysis. However, these datasets are inherently heterogeneous, stemming from diverse sensors, geographical regions, acquisition times, and atmospheric conditions. Distribution shifts between training and deployment domains severely limit the generalization of pretrained remote sensing models, making unsupervised domain adaptation (UDA) crucial for real‑world applications. We introduce FlowEO, a novel framework that leverages generative models for image‑space UDA in Earth observation. We leverage flow matching to learn a semantically preserving mapping that transports from the source to the target image distribution. This allows us to tackle challenging domain adaptation configurations for classification and semantic segmentation of Earth observation images. We conduct extensive experiments across four datasets covering adaptation scenarios such as SAR to optical translation and temporal and semantic shifts caused by natural disasters. Experimental results demonstrate that FlowEO outperforms existing image translation approaches for domain adaptation while achieving on‑par or better perceptual image quality, highlighting the potential of flow‑matching‑based UDA for remote sensing.

Abstract:
Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g., at row edges) as sequence neighbors. Existing 2D approaches typically treat spatial axes independently, failing to decouple this false sequential proximity from true spatial distance. To restore the 2D spatial manifold, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome non‑commutativity and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean in the Lie algebra. This creates a geometrically coupled encoding that effectively separates spatial dimensions. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms existing 2D RoPE variants and significantly enhances shape bias, confirming its ability to capture true geometric structure.

Abstract:
Depth completion plays a vital role in 3D perception systems, especially in scenarios where sparse depth data must be densified for tasks such as autonomous driving, robotics, and augmented reality. While many existing approaches rely on semantic segmentation to guide depth completion, they often overlook the benefits of object‑level understanding. In this work, we introduce an instance‑aware depth completion framework that explicitly integrates binary instance masks as spatial priors to refine depth predictions. Our model combines four main components: a frozen YOLO V11 instance segmentation branch, a U‑Net‑based depth completion backbone, a cross‑attention fusion module, and an attention‑guided prediction head. The instance segmentation branch generates per‑image foreground masks that guide the depth branch via cross‑attention, allowing the network to focus on object‑centric regions during refinement. We validate our method on the Virtual KITTI 2 dataset, showing that it achieves lower Root Mean Squared Error (RMSE) compared to both a U‑Net‑only baseline and previous semantic‑guided methods, while maintaining competitive Mean Absolute Error (MAE). Qualitative and quantitative results demonstrate that the proposed model effectively enhances depth accuracy near object boundaries, occlusions, and thin structures. Our findings suggest that incorporating instance‑aware cues offers a promising direction for improving depth completion without relying on dense semantic labels.

Abstract:
Infrared imaging plays a critical role in low‑light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre‑trained on large‑scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non‑uniform noise. In this paper, we propose a Dual‑domain Guided Infrared foundation model based on MAE (DuGI‑MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high‑entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual‑Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non‑uniform background noise commonly present in infrared imagery. To facilitate large‑scale pretraining, we construct Inf‑590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf‑590K, DuGI‑MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self‑supervised comparison methods. Our code is available in the supplementary material.

Abstract:
Transformer‑based architectures have become a dominant paradigm in vision and language, but their success is often attributed to large model capacity and massive training data. In this work, we examine how self‑supervised pre‑training, intermediate fine‑tuning, and downstream fine‑tuning interact in a low‑capacity regime, using a 5M‑parameter Vision Transformer for semantic segmentation. Across multiple data scales, we find that masked image modeling pre‑training and downstream fine‑tuning reliably improve performance, but with clear diminishing returns as supervision increases. In contrast, inserting an intermediate classification fine‑tuning stage consistently degrades downstream performance, with the largest drops occurring precisely where pre‑training is most effective. Through an analysis of patch‑level representation geometry, we show that classification‑based intermediate supervision actively interferes with representations learned during pre‑training by collapsing spatial structure critical for dense prediction. These results indicate that, in small models, the geometry of supervision matters more than the number of training stages: misaligned intermediate objectives can negate the benefits of pre‑training rather than amplify them.

Abstract:
This paper presents an autonomous tomato‑harvesting system built around a hybrid robotic gripper that combines six soft auxetic fingers with a rigid exoskeleton and a latex basket to achieve gentle, cage‑like grasping. The gripper is driven by a servo‑actuated Scotch‑‑yoke mechanism, and includes separator leaves that form a conical frustum for fruit isolation, with an integrated micro‑servo cutter for pedicel cutting. For perception, an RGB‑‑D camera and a Detectron2‑based pipeline perform semantic segmentation of ripe/unripe tomatoes and keypoint localization of the pedicel and fruit center under occlusion and variable illumination. An analytical model derived using the principle of virtual work relates servo torque to grasp force, enabling design‑level reasoning about actuation requirements. During execution, closed‑loop grasp‑force regulation is achieved using a proportional‑‑integral‑‑derivative controller with feedback from force‑sensitive resistors mounted on selected fingers to prevent slip and bruising. Motion execution is supported by Particle Swarm Optimization (PSO)‑‑based trajectory planning for a 5‑DOF manipulator. Experiments demonstrate complete picking cycles (approach, separation, cutting, grasping, transport, release) with an average cycle time of 24.34~s and an overall success rate of approximately 80%, while maintaining low grasp forces (0.20‑‑0.50~N). These results validate the proposed hybrid gripper and integrated vision‑‑control pipeline for reliable harvesting in cluttered environments.

Abstract:
Generalizing open‑vocabulary 3D instance segmentation (OV‑3DIS) to diverse, unstructured, and mesh‑free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset‑specific proposal networks or mesh‑based superpoints, rendering them inapplicable in mesh‑free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP‑based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre‑generated proposals, OpenTrack3D employs a novel visual‑spatial tracker to construct cross‑view consistent object proposals online. Given an RGB‑D stream, our pipeline first leverages a 2D open‑vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask‑guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh‑free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi‑modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state‑of‑the‑art performance and strong generalization capabilities.

Abstract:
When applied sequentially to video, frame‑based networks often exhibit temporal inconsistency ‑ for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time‑varying corruptions. In this work, we introduce a general approach for adapting frame‑based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource‑efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy‑stability‑robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well‑behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

Abstract:
Recent advances in 4D radar highlight its potential for robust environment perception under adverse conditions, yet progress in radar semantic segmentation remains constrained by the scarcity of open source datasets and labels. The RaDelft data set, although seminal, provides only LiDAR annotations and no public code to generate radar labels, limiting reproducibility and downstream research. In this work, we reproduce the numerical results of the RaDelft group and demonstrate that a camera‑guided radar labeling pipeline can generate accurate labels for radar point clouds without relying on human annotations. By projecting radar point clouds into camera‑based semantic segmentation and applying spatial clustering, we create labels that significantly enhance the accuracy of radar labels. These results establish a reproducible framework that allows the research community to train and evaluate the labeled 4D radar data. In addition, we study and quantify how different fog levels affect the radar labeling performance.

Abstract:
Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real‑world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine‑tuning or adaptation is hindered, leading to the demand for input‑level strategies that can enhance generalization without modifying model weights. To this end, we propose a Style‑Adaptive GEneralization framework (SAGE), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine‑tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed‑loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state‑of‑the‑art methods under privacy constraints and outperforms full fine‑tuning baselines in all settings.

Abstract:
Multifractal analysis has revealed regularities in many self‑seeding phenomena, yet its use in modern deep learning remains limited. Existing end‑to‑end multifractal methods rely on heavy pooling or strong feature‑space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel‑attention functions in convolutional networks. Using a U‑Net‑based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel‑attention mechanisms that also use higher‑order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical‑imaging datasets: ISIC18 (dermoscopy), Kvasir‑SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U‑Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability.

Abstract:
The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object‑aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.

Abstract:
In recent years, Contrastive Language‑Image Pretraining (CLIP) has been widely applied to Weakly Supervised Semantic Segmentation (WSSS) tasks due to its powerful cross‑modal semantic understanding capabilities. This paper proposes a novel Semantic and Spatial Rectification (SSR) method to address the limitations of existing CLIP‑based weakly supervised semantic segmentation approaches: over‑activation in non‑target foreground regions and background areas. Specifically, at the semantic level, the Cross‑Modal Prototype Alignment (CMPA) establishes a contrastive learning mechanism to enforce feature space alignment across modalities, reducing inter‑class overlap while enhancing semantic correlations, to rectify over‑activation in non‑target foreground regions effectively; at the spatial level, the Superpixel‑Guided Correction (SGC) leverages superpixel‑based spatial priors to precisely filter out interference from non‑target regions during affinity propagation, significantly rectifying background over‑activation. Extensive experiments on the PASCAL VOC and MS COCO datasets demonstrate that our method outperforms all single‑stage approaches, as well as more complex multi‑stage approaches, achieving mIoU scores of 79.5% and 50.6%, respectively.

Abstract:
Test‑Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key‑value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test‑Time Training (ViT^3) model, a pure TTT architecture that achieves linear complexity and parallelizable computation. We evaluate ViT^3 across diverse visual tasks, including image classification, image generation, object detection, and semantic segmentation. Results show that ViT^3 consistently matches or outperforms advanced linear‑complexity models (e.g., Mamba and linear attention variants) and effectively narrows the gap to highly optimized vision Transformers. We hope this study and the ViT^3 baseline can facilitate future work on visual TTT models. Code: github.com/LeapLabTHU/ViTTT.

Abstract:
Liquid argon time projection chambers (LArTPCs) provide dense, high‑fidelity 3D measurements of particle interactions and underpin current and future neutrino and rare‑event experiments. Physics reconstruction typically relies on complex detector‑specific pipelines that use tens of hand‑engineered pattern recognition algorithms or cascades of task‑specific neural networks that require extensive, labeled simulation that requires a careful, time‑consuming calibration process. We introduce Panda, a model that learns reusable sensor‑level representations directly from raw unlabeled LArTPC data. Panda couples a hierarchical sparse 3D encoder with a multi‑view, prototype‑based self‑distillation objective. On a simulated dataset, Panda substantially improves label efficiency and reconstruction quality, beating the previous state‑of‑the‑art semantic segmentation model with 1,000× fewer labels. We also show that a single set‑prediction head 1/20th the size of the backbone with no physical priors trained on frozen outputs from Panda can result in particle identification that is comparable with state‑of‑the‑art (SOTA) reconstruction tools. Full fine‑tuning further improves performance across all tasks.

Abstract:
3D Gaussian Splatting (3D‑GS) has emerged as an efficient 3D representation and a promising foundation for semantic tasks like segmentation. However, existing 3D‑GS‑based segmentation methods typically rely on high‑dimensional category features, which introduce substantial memory overhead. Moreover, fine‑grained segmentation remains challenging due to label space congestion and the lack of stable multi‑granularity control mechanisms. To address these limitations, we propose a coarse‑to‑fine binary encoding scheme for per‑Gaussian category representation, which compresses each feature into a single integer via the binary‑to‑decimal mapping, drastically reducing memory usage. We further design a progressive training strategy that decomposes panoptic segmentation into a series of independent sub‑tasks, reducing inter‑class conflicts and thereby enhancing fine‑grained segmentation capability. Additionally, we fine‑tune opacity during segmentation training to address the incompatibility between photometric rendering and semantic segmentation, which often leads to foreground‑background confusion. Extensive experiments on multiple benchmarks demonstrate that our method achieves state‑of‑the‑art segmentation performance while significantly reducing memory consumption and accelerating inference.

Abstract:
Intelligent driving systems are vulnerable to physical adversarial attacks on traffic signs. These attacks can cause misclassification, leading to erroneous driving decisions that compromise road safety. Moreover, within V2X networks, such misinterpretations can propagate, inducing cascading failures that disrupt overall traffic flow and system stability. However, a key limitation of current physical attacks is their lack of stealth. Most methods apply perturbations to central regions of the sign, resulting in visually salient patterns that are easily detectable by human observers, thereby limiting their real‑world practicality. This study proposes TESP‑Attack, a novel stealth‑aware adversarial patch method for traffic sign classification. Based on the observation that human visual attention primarily focuses on the central regions of traffic signs, we employ instance segmentation to generate edge‑aligned masks that conform to the shape characteristics of the signs. A U‑Net generator is utilized to craft adversarial patches, which are then optimized through color and texture constraints along with frequency domain analysis to achieve seamless integration with the background environment, resulting in highly effective visual concealment. The proposed method demonstrates outstanding attack success rates across traffic sign classification models with varied architectures, achieving over 90% under limited query budgets. It also exhibits strong cross‑model transferability and maintains robust real‑world performance that remains stable under varying angles and distances.

Abstract:
The increasing prevalence of thyroid cancer globally has led to the development of various computer‑aided detection methods. Accurate segmentation of thyroid nodules is a critical first step in the development of AI‑assisted clinical decision support systems. This study focuses on instance segmentation of thyroid nodules using YOLOv5 algorithms on ultrasound images. We evaluated multiple YOLOv5 variants (Nano, Small, Medium, Large, and XLarge) across two dataset versions, with and without doppler images. The YOLOv5‑Large algorithm achieved the highest performance with a dice score of 91% and mAP of 0.87 on the dataset including doppler images. Notably, our results demonstrate that doppler images, typically excluded by physicians, can significantly improve segmentation performance. The YOLOv5‑Small model achieved 79% dice score when doppler images were excluded, while including them improved performance across all model variants. These findings suggest that instance segmentation with YOLOv5 provides an effective real‑time approach for thyroid nodule detection, with potential clinical applications in automated diagnostic systems.

Abstract:
Superpoint‑based pipelines provide an efficient alternative to point‑ or voxel‑based 3D semantic segmentation, but are often bottlenecked by their CPU‑bound partition step. We propose a learnable, fully GPU partitioning algorithm that generates geometrically and semantically coherent superpoints 13× faster than prior methods. Our module is compact (under 60k parameters), trains in under 20 minutes with a differentiable surrogate loss, and requires no handcrafted features. Combine with a lightweight superpoint classifier, the full pipeline fits in <2 MB of VRAM, scales to multi‑million‑point scenes, and supports real‑time inference. With 72× faster inference and 120× fewer parameters, EZ‑SP matches the accuracy of point‑based SOTA models across three domains: indoor scans (S3DIS), autonomous driving (KITTI‑360), and aerial LiDAR (DALES). Code and pretrained models are accessible at github.com/drprojects/superpoint_transformer.

Abstract:
Document chunking is a crucial component of Retrieval‑Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed‑length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full‑text PubMed Central articles. Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive analysis, including statistical significance and response‑time comparisons with common chunking libraries. Despite being trained on a single domain, PSC and MFC also generalize well, achieving strong out‑of‑domain generation performance across multiple datasets. Overall, our findings confirm that our semantic chunkers, especially PSC, consistently deliver superior performance.

Abstract:
Human physical reasoning relies on internal "body" representations ‑ coarse, volumetric approximations that capture an object's extent and support intuitive predictions about motion and physics. While psychophysical evidence suggests humans use such coarse representations, their internal structure remains largely unknown. Here we test whether vision models trained for segmentation develop comparable representations. We adapt a psychophysical experiment conducted with 50 human participants to a semantic segmentation task and test a family of seven segmentation networks, varying in size. We find that smaller models naturally form human‑like coarse body representations, whereas larger models tend toward overly detailed, fine‑grain encodings. Our results demonstrate that coarse representations can emerge under limited computational resources, and that machine representations can provide a scalable path toward understanding the structure of physical reasoning in the brain.

Abstract:
Accurate assessment of post‑disaster damage is essential for prioritizing emergency response, yet current practices rely heavily on manual interpretation of satellite imagery.This approach is time‑consuming, subjective, and difficult to scale during large‑area disasters. Although recent deep‑learning models for semantic segmentation and change detection have improved automation, many of them still struggle to capture subtle structural variations and often perform poorly when dealing with highly imbalanced datasets, where undamaged buildings dominate. This thesis introduces Satellite‑to‑Street:Disaster Impact Estimator, a deep‑learning framework that produces detailed, pixel‑level damage maps by analyzing pre and post‑disaster satellite images together. The model is built on a modified dual‑input U‑Net architecture that strengthens feature fusion between both images, allowing it to detect not only small, localized changes but also broader contextual patterns across the scene. To address the imbalance between damage categories, a class‑aware weighted loss function is used, which helps the model better recognize major and destroyed structures. A consistent preprocessing pipeline is employed to align image pairs, standardize resolutions, and prepare the dataset for training. Experiments conducted on publicly available disaster datasets show that the proposed framework achieves better classification of damaged regions compared to conventional segmentation networks.The generated damage maps provide faster and objective method for analyzing disaster impact, working alongside expert judgment rather than replacing it. In addition to identifying which areas are damaged, the system is capable of distinguishing different levels of severity, ranging from slight impact to complete destruction. This provides a more detailed and practical understanding of how the disaster has affected each region.

Abstract:
Forests play a critical role in global ecosystems by supporting biodiversity and mitigating climate change via carbon sequestration. Accurate aboveground biomass (AGB) estimation is essential for assessing carbon storage and wildfire fuel loads, yet traditional methods rely on labor‑intensive field measurements or remote sensing approaches with significant limitations in dense vegetation. In this work, we propose a novel learning‑based method for estimating AGB from a single ground‑based RGB image. We frame this as a dense prediction task, introducing AGB density maps, where each pixel represents tree biomass normalized by the plot area and each tree's image area. We leverage the recently introduced synthetic 3D SPREAD dataset, which provides realistic forest scenes with per‑image tree attributes (height, trunk and canopy diameter) and instance segmentation masks. Using these assets, we compute AGB via allometric equations and train a model to predict AGB density maps, integrating them to recover the AGB estimate for the captured scene. Our approach achieves a median AGB estimation error of 1.22 kg/m^2 on held‑out SPREAD data and 1.94 kg/m^2 on a real‑image dataset. To our knowledge, this is the first method to estimate aboveground biomass directly from a single RGB image, opening up the possibility for a scalable, interpretable, and cost‑effective solution for forest monitoring, while also enabling broader participation through citizen science initiatives.

Abstract:
Extreme exposure degrades both the 3D map reconstruction and semantic segmentation accuracy, which is particularly detrimental to tightly‑coupled systems. To achieve illumination invariance, we propose a novel semantic SLAM framework with two designs. First, the Intrinsic Appearance Normalization (IAN) module proactively disentangles the scene's intrinsic properties, such as albedo, from transient lighting. By learning a standardized, illumination‑invariant appearance model, it assigns a stable and consistent color representation to each Gaussian primitive. Second, the Dynamic Radiance Balancing Loss (DRB‑Loss) reactively handles frames with extreme exposure. It activates only when an image's exposure is poor, operating directly on the radiance field to guide targeted optimization. This prevents error accumulation from extreme lighting without compromising performance under normal conditions. The synergy between IAN's proactive invariance and DRB‑Loss's reactive correction endows our system with unprecedented robustness. Evaluations on public datasets demonstrate state‑of‑the‑art performance in camera tracking, map quality, and semantic and geometric accuracy.

Abstract:
Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high‑precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open‑source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.

Abstract:
This paper presents a novel approach for affordance‑informed robotic manipulation by introducing 3D keypoints to enhance the understanding of object parts' functionality. The proposed approach provides direct information about what the potential use of objects is, as well as guidance on where and how a manipulator should engage, whereas conventional methods treat affordance detection as a semantic segmentation task, focusing solely on answering the what question. To address this gap, we propose a Fusion‑based Affordance Keypoint Network (FAKP‑Net) by introducing 3D keypoint quadruplet that harnesses the synergistic potential of RGB and Depth image to provide information on execution position, direction, and extent. Benchmark testing demonstrates that FAKP‑Net outperforms existing models by significant margins in affordance segmentation task and keypoint detection task. Real‑world experiments also showcase the reliability of our method in accomplishing manipulation tasks with previously unseen objects.

Abstract:
Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real‑valued neural networks. In this paper, we extend the work on LPS to complex‑valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from \mathbbC to \mathbbR before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.

Abstract:
Referring video object segmentation (RVOS) is an emerging cross‑modality task that aims to generate pixel‑level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross‑modality alignment through conditional queries, tracking the target object using a query‑response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter‑frame dependency and variation modeling, making accurate target tracking challenging amid significant frame‑to‑frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non‑referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter‑frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross‑modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video‑text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state‑of‑the‑art methods.

Abstract:
Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human‑annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human‑annotated data to model‑annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.

Abstract:
This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U‑Net architectures, using various convolutional neural network (CNN) encoders, for pixel‑level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out‑of‑distribution qualitative evaluation on an unlabeled test set of real‑world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN‑ based encoders for fine‑grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.

Abstract:
In Remote Sensing (RS), Parameter‑Efficient Fine‑Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large‑scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth‑Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher‑guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task‑specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth‑Gate achieves state‑of‑the‑art performance across 16 cross‑domain benchmarks for RS semantic segmentation. The code of the work will be released.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) aims to segment and recognize objects universally. Trained on extensive high‑quality segmentation data, the segment anything model (SAM) has demonstrated remarkable universal segmentation capabilities, offering valuable support for OVSS. Although previous methods have made progress in leveraging SAM for OVSS, there are still some challenges: (1) SAM's tendency to over‑segment and (2) hard combinations between fixed masks and labels. This paper introduces a novel mask‑injected framework, SAM‑MI, which effectively integrates SAM with OVSS models to address these challenges. Initially, SAM‑MI employs a Text‑guided Sparse Point Prompter to sample sparse prompts for SAM instead of previous dense grid‑like prompts, thus significantly accelerating the mask generation process. The framework then introduces Shallow Mask Aggregation (SMAgg) to merge partial masks to mitigate the SAM's over‑segmentation issue. Finally, Decoupled Mask Injection (DMI) incorporates SAM‑generated masks for guidance at low‑frequency and high‑frequency separately, rather than directly combining them with labels. Extensive experiments on multiple benchmarks validate the superiority of SAM‑MI. Notably, the proposed method achieves a 16.7% relative improvement in mIoU over Grounded‑SAM on the MESS benchmark, along with a 1.6× speedup. We hope SAM‑MI can serve as an alternative methodology to effectively equip the OVSS model with SAM.

Abstract:
Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post‑optimization, or require fine‑tuning on 4D datasets. In this paper, we propose VGGT4D, a training‑free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT's global attention layers already implicitly encode rich, layer‑wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven by projection gradient. We then integrate these precise masks into VGGT's early‑stage inference, effectively mitigating motion interference in both pose estimation and geometric reconstruction. Across six datasets, our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction. It also supports single‑pass inference on sequences longer than 500 frames.

Abstract:
Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self‑attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel‑level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero‑shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test‑time optimization strategies‑DDIM inversion, textual inversion, and adaptive head weighting‑in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM‑guided mask refinement, achieving state‑of‑the‑art zero‑shot performance on standard video object segmentation benchmarks.

Abstract:
This paper employs a multimodal approach for continuous sign recognition by first using ML for detecting the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then by recognizing the segmented signs. For improved robustness we use 3D skeletal features extracted from sign language videos to take into account the convergence of sign properties and their dynamics that tend to cluster at sign boundaries. Another focus of this paper is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for detection of 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation‑form isolated signs and signs pre‑segmented (based on manual annotations) from continuous signing‑as such signs often differ a bit in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

Abstract:
Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under‑specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective‑without altering the MiT backbone or relying on heavy post‑processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary‑aware loss; (2) an uncertainty‑guided refiner that predicts per‑pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi‑scale fusion layer that replaces static concatenation with spatial softmax gating over multi‑resolution features, optionally modulated by uncertainty. The result is a single‑pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F‑score, small‑object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder‑centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher‑fidelity masks from image‑level supervision.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero‑shot heuristics to vision‑language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero‑shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero‑shot OVSS and enhance its performance through self‑correlating recursive attention, self‑correlating global aggregation, and computationally efficient RADIO SAM mask refinement. Our approach, RADSeg, achieves 6‑30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg‑base (106M) outperforms previous combinations of huge vision models (850‑1350M) in mIoU, achieving state‑of‑the‑art accuracy with substantially lower computational and memory cost.

Abstract:
This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large‑scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic‑symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.

Abstract:
The rapid rise of large‑scale foundation models has reshaped the landscape of image segmentation, with models such as Segment Anything achieving unprecedented versatility across diverse vision tasks. However, previous generations‑including SAM and its successor‑still struggle with fine‑grained, low‑level segmentation challenges such as camouflaged object detection, medical image segmentation, cell image segmentation, and shadow detection. To address these limitations, we originally proposed SAM‑Adapter in 2023, demonstrating substantial gains on these difficult scenarios. With the emergence of Segment Anything 3 (SAM3)‑a more efficient and higher‑performing evolution with a redesigned architecture and improved training pipeline‑we revisit these long‑standing challenges. In this work, we present SAM3‑Adapter, the first adapter framework tailored for SAM3 that unlocks its full segmentation capability. SAM3‑Adapter not only reduces computational overhead but also consistently surpasses both SAM and SAM2‑based solutions, establishing new state‑of‑the‑art results across multiple downstream tasks, including medical imaging, camouflaged (concealed) object segmentation, and shadow detection. Built upon the modular and composable design philosophy of the original SAM‑Adapter, SAM3‑Adapter provides stronger generalizability, richer task adaptability, and significantly improved segmentation precision. Extensive experiments confirm that integrating SAM3 with our adapter yields superior accuracy, robustness, and efficiency compared to all prior SAM‑based adaptations. We hope SAM3‑Adapter can serve as a foundation for future research and practical segmentation applications. Code, pre‑trained models, and data processing pipelines are available.

Abstract:
Diffusion‑based editing enables realistic modification of local image regions, making AI‑generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion‑based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion‑edited images with pixel‑level annotations, designed to support fine‑grained detection. DiffSeg30k features: 1) In‑the‑wild images‑‑we collect images or image prompts from COCO to reflect real‑world content diversity; 2) Diverse diffusion models‑‑local edits using eight SOTA diffusion models; 3) Multi‑turn editing‑‑each image undergoes up to three sequential edits to mimic real‑world sequential editing; and 4) Realistic editing scenarios‑‑a vision‑language model (VLM)‑based pipeline automatically identifies meaningful regions and generates context‑aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel‑level localization, emerge as highly reliable whole‑image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross‑generator generalization. We believe DiffSeg30k will advance research in fine‑grained localization of AI‑generated content by demonstrating the promise and limitations of segmentation‑based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k

Abstract:
Existing autoregressive (AR) methods for generating artist‑designed meshes struggle to balance global structural consistency with high‑fidelity local details, and are susceptible to error accumulation. To address this, we propose PartDiffuser, a novel semi‑autoregressive diffusion framework for point‑cloud‑to‑mesh generation. The method first performs semantic segmentation on the mesh and then operates in a "part‑wise" manner: it employs autoregression between parts to ensure global topology, while utilizing a parallel discrete diffusion process within each semantic part to precisely reconstruct high‑frequency geometric features. PartDiffuser is based on the DiT architecture and introduces a part‑aware cross‑attention mechanism, using point clouds as hierarchical geometric conditioning to dynamically control the generation process, thereby effectively decoupling the global and local generation tasks. Experiments demonstrate that this method significantly outperforms state‑of‑the‑art (SOTA) models in generating 3D meshes with rich detail, exhibiting exceptional detail representation suitable for real‑world applications.

Abstract:
We have introduced SegSplat, a novel framework designed to bridge the gap between rapid, feed‑forward 3D reconstruction and rich, open‑vocabulary semantic understanding. By constructing a compact semantic memory bank from multi‑view 2D foundation model features and predicting discrete semantic indices alongside geometric and appearance attributes for each 3D Gaussian in a single pass, SegSplat efficiently imbues scenes with queryable semantics. Our experiments demonstrate that SegSplat achieves geometric fidelity comparable to state‑of‑the‑art feed‑forward 3D Gaussian Splatting methods while simultaneously enabling robust open‑set semantic segmentation, crucially without requiring any per‑scene optimization for semantic feature integration. This work represents a significant step towards practical, on‑the‑fly generation of semantically aware 3D environments, vital for advancing robotic interaction, augmented reality, and other intelligent systems.

Abstract:
Deep unfolding networks (DUNs) have recently advanced concealed object segmentation (COS) by modeling segmentation as iterative foreground‑background separation. However, existing DUN‑based methods (RUN) inherently couple background estimation with image restoration, leading to conflicting objectives and requiring pre‑defined degradation types, which are unrealistic in real‑world scenarios. To address this, we propose the nested unfolding network (NUN), a unified framework for real‑world COS. NUN adopts a DUN‑in‑DUN design, embedding a degradation‑resistant unfolding network (DeRUN) within each stage of a segmentation‑oriented unfolding network (SODUN). This design decouples restoration from segmentation while allowing mutual refinement. Guided by a vision‑language model (VLM), DeRUN dynamically infers degradation semantics and restores high‑quality images without explicit priors, whereas SODUN performs reversible estimation to refine foreground and background. Leveraging the multi‑stage nature of unfolding, NUN employs image‑quality assessment to select the best DeRUN outputs for subsequent stages, naturally introducing a self‑consistency loss that enhances robustness. Extensive experiments show that NUN achieves a leading place on both clean and degraded benchmarks. Code will be released.

Abstract:
Existing methods for label‑deficient concealed object segmentation (LDCOS) either rely on consistency constraints or Segment Anything Model (SAM)‑based pseudo‑labeling. However, their performance remains limited due to the intrinsic concealment of targets and the scarcity of annotations. This study investigates two key questions: (1) Can consistency constraints and SAM‑based supervision be jointly integrated to better exploit complementary information and enhance the segmenter? and (2) beyond that, can the segmenter in turn guide SAM through reciprocal supervision, enabling mutual improvement? To answer these questions, we present SCALER, a unified collaborative framework toward LDCOS that jointly optimizes a mean‑teacher segmenter and a learnable SAM. SCALER operates in two alternating phases. In Phase \uppercase\expandafter\romannumeral1, the segmenter is optimized under fixed SAM supervision using entropy‑based image‑level and uncertainty‑based pixel‑level weighting to select reliable pseudo‑label regions and emphasize harder examples. In Phase \uppercase\expandafter\romannumeral2, SAM is updated via augmentation invariance and noise resistance losses, leveraging its inherent robustness to perturbations. Experiments demonstrate that SCALER yields consistent performance gains across eight semi‑ and weakly‑supervised COS tasks. The results further suggest that SCALER can serve as a general training paradigm to enhance both lightweight segmenters and large foundation models under label‑scarce conditions. Code will be released.

Abstract:
Modern transformer architectures achieve remarkable performance across tasks and domains but remain rigid in how they allocate computation at inference time. Real‑world deployment often requires models to adapt to diverse hardware and latency constraints, yet most approaches to dynamic computation focus on a single axis ‑‑ such as reducing the number of tokens. We present a novel capability: AdaPerceiver, the first transformer architecture with unified adaptivity across depth, width, and tokens within a single model. We propose an architecture that supports adaptivity along these axes. We couple this with an efficient joint training regime that ensures the model maintains performance across its various configurations. We evaluate AdaPerceiver on image classification, semantic segmentation, and depth estimation tasks. On image classification, AdaPerceiver expands the accuracy‑throughput Pareto front. It achieves 85.4% accuracy while yielding 36% higher throughput than FlexiViT‑L. On dense prediction, AdaPerceiver matches ViT‑H/14 while having ～26x fewer encoder FLOPs (floating‑point operations) on semantic segmentation and depth estimation. Finally, we show how AdaPerceiver equipped with a policy can maintain ImageNet1K accuracy (\pm0.1 percentage points) while reducing FLOPs by 24‑33%.

Abstract:
Accurate BEV semantic segmentation from fisheye imagery remains challenging due to extreme non‑linear distortion, occlusion, and depth ambiguity inherent to wide‑angle projections. We present a distortion‑aware BEV segmentation framework that directly processes multi‑camera high‑resolution fisheye images,utilizing calibrated geometric unprojection and per‑pixel depth distribution estimation. Each image pixel is lifted into 3D space via Gaussian parameterization, predicting spatial means and anisotropic covariances to explicitly model geometric uncertainty. The projected 3D Gaussians are fused into a BEV representation via differentiable splatting, producing continuous, uncertainty‑aware semantic maps without requiring undistortion or perspective rectification. Extensive experiments demonstrate strong segmentation performance on complex parking and urban driving scenarios, achieving IoU scores of 87.75% for drivable regions and 57.26% for vehicles under severe fisheye distortion and diverse environmental conditions.

Abstract:
Scalable Vector Graphics are a standard representation for editable visual design, yet they are usually authored as single view two dimensional illustrations. This limits their use in applications that require object level assets to remain coherent when observed, edited, or animated from different viewpoints. We present SVG360, a framework that converts a single input SVG into geometrically and visually consistent multiview SVG assets. The key challenge is that direct per view generation or vectorization produces view dependent regions, fragmented paths, and unstable colors, making the resulting SVGs difficult to edit as a coherent object. SVG360 addresses this problem through a view consistent vectorization pipeline. It first lifts the rasterized input into a view conditioned object representation and renders target views under prescribed cameras. It then propagates part identity across neighboring views using a spatial memory mechanism adapted from video segmentation, establishing consistent region decomposition, path correspondence, and color assignment without task specific retraining. Finally, each view is reconstructed as an editable SVG through structure aware vectorization, where redundant paths are consolidated and local geometry is optimized while preserving boundaries and semantic parts. Experiments on object level SVG assets show that SVG360 improves multiview consistency, reduces path redundancy, and better preserves fine structures compared with direct per view vectorization. By turning a single view SVG into a coherent 360 degree vector asset, SVG360 expands vector graphics from static illustration toward editable multiview content for design, animation, and structured visual editing.

Abstract:
3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi‑grained and multi‑hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi‑label learning with a parameter‑sharing model can lead to multi‑hierarchy conflicts in cross‑hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi‑hierarchy conflicts, we propose a late‑decoupled 3DHS framework which employs multiple decoders with the coarse‑to‑fine hierarchical guidance and consistency. The late‑decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS‑oriented semantic prototype based bi‑branch supervision mechanism, which additionally learns class‑wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class‑imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state‑of‑the‑art 3DHS performance, and its core components can also be used as a plug‑and‑play enhancement to improve previous methods.

Abstract:
The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher‑student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone‑based‑riverbed‑inspection

Abstract:
Open‑vocabulary semantic segmentation (OVSS) employs pixel‑level vision‑language alignment to associate category‑related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel‑level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision‑language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension‑specific over‑activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF‑CLIP), a training‑free approach that emulates human distraction‑refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

Abstract:
Recently, the strong generalization ability of CLIP has facilitated open‑vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine‑tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision‑language alignment. To stabilize modality alignment during fine‑tuning, we propose InfoCLIP, which leverages an information‑theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel‑text modality alignment from pretrained CLIP to reduce noise arising from its coarse‑grained local semantic representations learned under image‑text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine‑tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine‑tuning for open‑vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.

Abstract:
The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning‑based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high‑quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real‑world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of a historical map corpus onto modern vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land‑cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and an alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as aleatoric uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the bootstrapped training datasets were employed for domain‑adaptive semantic segmentation on a homogeneous map corpus using a Self‑Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.

Abstract:
Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO‑Bench‑2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively‑licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high‑resolution tasks, EO‑specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO‑Bench‑2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO‑Bench‑2 are publicly released under a permissive license.

Abstract:
Recent CLIP‑based few‑shot semantic segmentation methods introduce class‑level textual priors to assist segmentation by typically using a single prompt (e.g., a photo of class). However, these approaches often result in incomplete activation of target regions, as a single textual description cannot fully capture the semantic diversity of complex categories. Moreover, they lack explicit cross‑modal interaction and are vulnerable to noisy support features, further degrading visual prior quality. To address these issues, we propose the Multi‑Text Guided Few‑Shot Semantic Segmentation Network (MTGNet), a dual‑branch framework that enhances segmentation performance by fusing diverse textual prompts to refine textual priors and guide the cross‑modal optimization of visual priors. Specifically, we design a Multi‑Textual Prior Refinement (MTPR) module that suppresses interference and aggregates complementary semantic cues to enhance foreground activation and expand semantic coverage for structurally complex objects. We introduce a Text Anchor Feature Fusion (TAFF) module, which leverages multi‑text embeddings as semantic anchors to facilitate the transfer of discriminative local prototypes from support images to query images, thereby improving semantic consistency and alleviating intra‑class variations. Furthermore, a Foreground Confidence‑Weighted Attention (FCWA) module is presented to enhance visual prior robustness by leveraging internal self‑similarity within support foreground features. It adaptively down‑weights inconsistent regions and effectively suppresses interference in the query segmentation process. Extensive experiments on standard FSS benchmarks validate the effectiveness of MTGNet. In the 1‑shot setting, it achieves 76.8% mIoU on PASCAL‑5i and 57.4% on COCO‑20i, with notable improvements in folds exhibiting high intra‑class variations.

Abstract:
We introduce WarNav, a novel real‑world dataset constructed from images of the open‑source DATTALION repository, specifically tailored to enable the development and benchmarking of semantic segmentation models for autonomous ground vehicle navigation in unstructured, conflict‑affected environments. This dataset addresses a critical gap between conventional urban driving resources and the unique operational scenarios encountered by unmanned systems in hazardous and damaged war‑zones. We detail the methodological challenges encountered, ranging from data heterogeneity to ethical considerations, providing guidance for future efforts that target extreme operational contexts. To establish performance references, we report baseline results on WarNav using several state‑of‑the‑art semantic segmentation models trained on structured urban scenes. We further analyse the impact of training data environments and propose a first step towards effective navigability in challenging environments with the constraint of having no annotation of the targeted images. Our goal is to foster impactful research that enhances the robustness and safety of autonomous vehicles in high‑risk scenarios while being frugal in annotated data.

Abstract:
Transparent object perception remains a major challenge in computer vision research, as transparency confounds both depth estimation and semantic segmentation. Recent work has explored multi‑task learning frameworks to improve robustness, yet negative cross‑task interactions often hinder performance. In this work, we introduce Edge‑Guided Spatial Attention (EGSA), a fusion mechanism designed to mitigate destructive interactions by incorporating boundary information into the fusion between semantic and geometric features. On both Syn‑TODD and ClearPose benchmarks, EGSA consistently improved depth accuracy over the current state of the art method (MODEST), while preserving competitive segmentation performance, with the largest improvements appearing in transparent regions. Besides our fusion design, our second contribution is a multi‑modal progressive training strategy, where learning transitions from edges derived from RGB images to edges derived from predicted depth images. This approach allows the system to bootstrap learning from the rich textures contained in RGB images, and then switch to more relevant geometric content in depth maps, while it eliminates the need for ground‑truth depth at training time. Together, these contributions highlight edge‑guided fusion as a robust approach capable of improving transparent object perception.

Abstract:
Delineating farm boundaries through segmentation of satellite images is a fundamental step in many agricultural applications. The task is particularly challenging for smallholder farms, where accurate delineation requires the use of high resolution (HR) imagery which are available only at low revisit frequencies (e.g., annually). To support more frequent (sub‑) seasonal monitoring, HR images could be combined as references (ref) with low resolution (LR) images ‑‑ having higher revisit frequency (e.g., weekly) ‑‑ using reference‑based super‑resolution (Ref‑SR) methods. However, current Ref‑SR methods optimize perceptual quality and smooth over crucial features needed for downstream tasks, and are unable to meet the large scale‑factor requirements for this task. Further, previous two‑step approaches of SR followed by segmentation do not effectively utilize diverse satellite sources as inputs. We address these problems through a new approach, SEED‑SR, which uses a combination of conditional latent diffusion models and large‑scale multi‑spectral, multi‑source geo‑spatial foundation models. Our key innovation is to bypass the explicit SR task in the pixel space and instead perform SR in a segmentation‑aware latent space. This unique approach enables us to generate segmentation maps at an unprecedented 20× scale factor, and rigorous experiments on two large, real datasets demonstrate up to 25.5 and 12.9 relative improvement in instance and semantic segmentation metrics respectively over approaches based on state‑of‑the‑art Ref‑SR methods.

Abstract:
Scaling up network depth is a fundamental pursuit in neural architecture design, as theory suggests that deeper models offer exponentially greater capability. Benefiting from the residual connections, modern neural networks can scale up to more than one hundred layers and enjoy wide success. However, as networks continue to deepen, current architectures often struggle to realize their theoretical capacity improvements, calling for more advanced designs to further unleash the potential of deeper networks. In this paper, we identify two key barriers that obstruct residual models from scaling deeper: shortcut degradation and limited width. Shortcut degradation hinders deep‑layer learning, while the inherent depth‑width trade‑off imposes limited width. To mitigate these issues, we propose a generalized residual architecture dubbed Step by Step Network (StepsNet) to bridge the gap between theoretical potential and practical performance of deep models. Specifically, we separate features along the channel dimension and let the model learn progressively via stacking blocks with increasing width. The resulting method mitigates the two identified problems and serves as a versatile macro design applicable to various models. Extensive experiments show that our method consistently outperforms residual models across diverse tasks, including image classification, object detection, semantic segmentation, and language modeling. These results position StepsNet as a superior generalization of the widely adopted residual architecture.

Abstract:
This work focuses on multi‑shot semi‑supervised video object segmentation (MVOS), which aims at segmenting the target object indicated by an initial mask throughout a video with multiple shots. The existing VOS methods mainly focus on single‑shot videos and struggle with shot discontinuities, thereby limiting their real‑world applicability. We propose a transition mimicking data augmentation strategy (TMA) which enables cross‑shot generalization with single‑shot data to alleviate the severe annotated multi‑shot data sparsity, and the Segment Anything Across Shots (SAAS) model, which can detect and comprehend shot transitions effectively. To support evaluation and future study in MVOS, we introduce Cut‑VOS, a new MVOS benchmark with dense mask annotations, diverse object categories, and high‑frequency transitions. Extensive experiments on YouMVOS and Cut‑VOS demonstrate that the proposed SAAS achieves state‑of‑the‑art performance by effectively mimicking, understanding, and segmenting across complex transitions. The code and datasets are released at https://henghuiding.com/SAAS/.

Abstract:
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually ‑ by adding more prompts or selecting from pre‑generated masks ‑ to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide‑and‑conquer strategy of UnSAM by discovering abundant mask‑granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only 6K unlabeled images and 0.02% additional parameters, UnSAMv2 substantially enhances SAM‑2, achieving segment anything at any granularity across interactive, whole‑image, and video segmentation tasks. Evaluated on over 11 benchmarks, UnSAMv2 improves \textNoC_90 (5.69 \rightarrow 4.75), 1‑IoU (58.0 \rightarrow 73.1), and \textAR_1000 (49.6 \rightarrow 68.3), showing that small amounts of unlabeled data with a granularity‑aware self‑supervised learning method can unlock the potential of vision foundation models.

Abstract:
We introduce GS‑Light, an efficient, textual position‑aware pipeline for text‑guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS‑Light implements a training‑free extension of a single‑input diffusion model to handle multi‑view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision‑language model (LVLM) to parse the prompt into lighting priors. Using off‑the‑shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view‑geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi‑view rendered images, along with the init latents, into our multi‑view relighting model, we produce high‑fidelity, artistically relit images. Finally, we fine‑tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS‑Light on both indoor and outdoor scenes, comparing it to state‑of‑the‑art baselines including per‑view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi‑view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS‑Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

Abstract:
Urban villages (UVs), informal settlements embedded within China's urban fabric, have undergone widespread demolition and redevelopment in recent decades. However, there remains a lack of systematic evaluation of whether the demolished land has been effectively reused, raising concerns about the efficacy and sustainability of current redevelopment practices. To address the gap, this study proposes a deep learning‑based framework to monitor the spatiotemporal changes of UVs in China. Specifically, semantic segmentation of multi‑temporal remote sensing imagery is first used to map evolving UV boundaries, and then post‑demolition land use is classified into six categories based on the "remained‑demolished‑redeveloped" phase: incomplete demolition, vacant land, construction sites, buildings, green spaces, and others. Four representative cities from China's four economic regions were selected as the study areas, i.e., Guangzhou (East), Zhengzhou (Central), Xi'an (West), and Harbin (Northeast). The results indicate: 1) UV redevelopment processes were frequently prolonged; 2) redevelopment transitions primarily occurred in peripheral areas, whereas urban cores remained relatively stable; and 3) three spatiotemporal transformation pathways, i.e., synchronized redevelopment, delayed redevelopment, and gradual optimization, were revealed. This study highlights the fragmented, complex and nonlinear nature of UV redevelopment, underscoring the need for tiered and context‑sensitive planning strategies. By linking spatial dynamics with the context of redevelopment policies, the findings offer valuable empirical insights that support more inclusive, efficient, and sustainable urban renewal, while also contributing to a broader global understanding of informal settlement transformations.

Abstract:
Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low‑level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real‑world physical interactions. To address these limitations, we propose MTV‑World, an embodied world model that introduces Multi‑view Trajectory‑Video control for precise visuomotor prediction. Specifically, instead of directly using low‑level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian‑space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi‑view framework that compensates for spatial information loss and ensures high‑consistency with physical world. MTV‑World forecasts future frames based on multi‑view trajectory videos as input and conditioning on an initial frame per view. Furthermore, to systematically evaluate both robotic motion precision and object interaction accuracy, we develop an auto‑evaluation pipeline leveraging multimodal large models and referring video object segmentation models. To measure spatial consistency, we formulate it as an object location matching problem and adopt the Jaccard Index as the evaluation metric. Extensive experiments demonstrate that MTV‑World achieves precise control execution and accurate physical interaction modeling in complex dual‑arm scenarios.

Abstract:
Reasoning segmentation enables open‑set object segmentation via implicit text queries, therefore serving as a foundation for embodied agents that should operate autonomously in real‑world environments. However, existing methods for reasoning segmentation require multimodal large language models with billions of parameters that exceed the computational capabilities of edge devices that typically deploy the embodied AI systems. Distillation offers a pathway to compress these models while preserving their capabilities. Yet, existing distillation approaches fail to transfer the multi‑step reasoning capabilities that reasoning segmentation demands, as they focus on matching output predictions and intermediate features rather than preserving reasoning chains. The emerging paradigm of reasoning over digital twin representations presents an opportunity for more effective distillation by re‑framing the problem. Consequently, we propose FastReasonSeg, which employs digital twin representations that decouple perception from reasoning to enable more effective distillation. Our distillation scheme first relies on supervised fine‑tuning on teacher‑generated reasoning chains. Then it is followed by reinforcement fine‑tuning with joint rewards evaluating both segmentation accuracy and reasoning quality alignment. Experiments on two video (JiTBench, RVTBench) and two image benchmarks (ReasonSeg, LLM‑Seg40K) demonstrate that our FastReasonSeg achieves state‑of‑the‑art reasoning segmentation performance. Moreover, the distilled 0.6B variant outperforms models with 20 times more parameters while achieving 7.79 FPS throughput with only 2.1GB memory consumption. This efficiency enables deployment in resource‑constrained environments to enable real‑time reasoning segmentation.

Abstract:
Medical vision‑language pre‑training (VLP) offers significant potential for advancing medical image understanding by leveraging paired image‑report data. However, existing methods are limited by False Negatives (FaNe) induced by semantically similar texts and insufficient fine‑grained cross‑modal alignment. To address these limitations, we propose FaNe, a semantic‑enhanced VLP framework. To mitigate false negatives, we introduce a semantic‑aware positive pair mining strategy based on text‑text similarity with adaptive normalization. Furthermore, we design a text‑conditioned sparse attention pooling module to enable fine‑grained image‑text alignment through localized visual representations guided by textual cues. To strengthen intra‑modal discrimination, we develop a hard‑negative aware contrastive loss that adaptively reweights semantically similar negatives. Extensive experiments on five downstream medical imaging benchmarks demonstrate that FaNe achieves state‑of‑the‑art performance across image classification, object detection, and semantic segmentation, validating the effectiveness of our framework.

Abstract:
This study presents a comparative analysis of three U‑Net‑based architectures for semantic segmentation of rock art petroglyphs from Brazilian archaeological sites. The investigated architectures were: (1) BEGL‑UNet with Border‑Enhanced Gaussian Loss function; (2) Attention‑Residual BEGL‑UNet, incorporating residual blocks and gated attention mechanisms; and (3) Spatial Channel Attention BEGL‑UNet, which employs spatial‑channel attention modules based on Convolutional Block Attention Module. All implementations employed the BEGL loss function combining binary cross‑entropy with Gaussian edge enhancement. Experiments were conducted on images from the Poço da Bebidinha Archaeological Complex, Piauí, Brazil, using 5‑fold cross‑validation. Among the architectures, Attention‑Residual BEGL‑UNet achieved the best overall performance with Dice Score of 0.710, validation loss of 0.067, and highest recall of 0.854. Spatial Channel Attention BEGL‑UNet obtained comparable performance with DSC of 0.707 and recall of 0.857. The baseline BEGL‑UNet registered DSC of 0.690. These results demonstrate the effectiveness of attention mechanisms for archaeological heritage digital preservation, with Dice Score improvements of 2.5‑2.9% over the baseline.

Abstract:
Low‑altitude Unmanned Aerial Vehicle (UAV) networks rely on robust semantic segmentation as a foundational enabler for distributed sensing‑communication‑control co‑design across heterogeneous agents within the network. However, segmentation foundation models deteriorate quickly under weather, lighting, and viewpoint drift. Resource‑limited UAVs cannot run gradient‑based test‑time adaptation, while resource‑massive UAVs adapt independently, wasting shared experience. To address these challenges, we propose AdaptFly, a prompt‑guided test‑time adaptation framework that adjusts segmentation models without weight updates. AdaptFly features two complementary adaptation modes. For resource‑limited UAVs, it employs lightweight token‑prompt retrieval from a shared global memory. For resource‑massive UAVs, it uses gradient‑free sparse visual prompt optimization via Covariance Matrix Adaptation Evolution Strategy. An activation‑statistic detector triggers adaptation, while cross‑UAV knowledge pool consolidates prompt knowledge and enables fleet‑wide collaboration with negligible bandwidth overhead. Extensive experiments on UAVid and VDD benchmarks, along with real‑world UAV deployments under diverse weather conditions, demonstrate that AdaptFly significantly improves segmentation accuracy and robustness over static models and state‑of‑the‑art TTA baselines. The results highlight a practical path to resilient, communication‑efficient perception in the emerging low‑altitude economy.

Abstract:
Reinforcement learning (RL) in 3D environments with high‑dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS‑only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS‑only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run‑length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents' performance with the additional semantic information provided. Furthermore, we explored density‑based heatmapping as a tool to visualise RL agents' movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.

Abstract:
Recent approaches for few‑shot 3D point cloud semantic segmentation typically require a two‑stage learning process, i.e., a pre‑training stage followed by a few‑shot training stage. While effective, these methods face overreliance on pre‑training, which hinders model flexibility and adaptability. Some models tried to avoid pre‑training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero‑shot ability. To address these limitations, we present a novel pre‑training‑free network, named Efficient Point Cloud Semantic Segmentation for Few‑ and Zero‑shot scenarios. Our EPSegFZ incorporates three key components. A Prototype‑Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)‑based cross‑attention mechanism for improved feature extraction and accurate query‑prototype correspondence construction without pre‑training. A Language‑Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few‑shot performance and enable zero‑shot inference. Extensive experiments show that our method outperforms the state‑of‑the‑art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.

Abstract:
Successful autonomous robot navigation in off‑road domains requires the ability to generate high‑quality terrain costmaps that are able to both generalize well over a wide variety of terrains and rapidly adapt relative costs at test time to meet mission‑specific needs. Existing approaches for costmap generation allow for either rapid test‑time adaptation of relative costs (e.g., semantic segmentation methods) or generalization to new terrain types (e.g., representation learning methods), but not both. In this work, we present scaled preference conditioned all‑terrain costmap generation (SPACER), a novel approach for generating terrain costmaps that leverages synthetic data during training in order to generalize well to new terrains, and allows for rapid test‑time adaptation of relative costs by conditioning on a user‑specified scaled preference context. Using large‑scale aerial maps, we provide empirical evidence that SPACER outperforms other approaches at generating costmaps for terrain navigation, with the lowest measured regret across varied preferences in five of seven environments for global path planning.

Abstract:
Histopathologists establish cancer grade by assessing histological structures, such as glands in prostate cancer. Yet, digital pathology pipelines often rely on grid‑based tiling that ignores tissue architecture. This introduces irrelevant information and limits interpretability. We introduce histology‑informed tiling (HIT), which uses semantic segmentation to extract glands from whole slide images (WSIs) as biologically meaningful input patches for multiple‑instance learning (MIL) and phenotyping. Trained on 137 samples from the ProMPT cohort, HIT achieved a gland‑level Dice score of 0.83 +/‑ 0.17. By extracting 380,000 glands from 760 WSIs across ICGC‑C and TCGA‑PRAD cohorts, HIT improved MIL models AUCs by 10% for detecting copy number variation (CNVs) in genes related to epithelial‑mesenchymal transitions (EMT) and MYC, and revealed 15 gland clusters, several of which were associated with cancer relapse, oncogenic mutations, and high Gleason. Therefore, HIT improved the accuracy and interpretability of MIL predictions, while streamlining computations by focussing on biologically meaningful structures during feature extraction.

Abstract:
Recent advances in 3D Gaussian Splatting (3DGS) have achieved state‑of‑the‑art results for novel view synthesis. However, efficiently capturing high‑fidelity reconstructions of specific objects within complex scenes remains a significant challenge. A key limitation of existing active reconstruction methods is their reliance on scene‑level uncertainty metrics, which are often biased by irrelevant background clutter and lead to inefficient view selection for object‑centric tasks. We present OUGS, a novel framework that addresses this challenge with a more principled, physically‑grounded uncertainty formulation for 3DGS. Our core innovation is to derive uncertainty directly from the explicit physical parameters of the 3D Gaussian primitives (e.g., position, scale, rotation). By propagating the covariance of these parameters through the rendering Jacobian, we establish a highly interpretable uncertainty model. This foundation allows us to then seamlessly integrate semantic segmentation masks to produce a targeted, object‑aware uncertainty score that effectively disentangles the object from its environment. This allows for a more effective active view selection strategy that prioritizes views critical to improving object fidelity. Experimental evaluations on public datasets demonstrate that our approach significantly improves the efficiency of the 3DGS reconstruction process and achieves higher quality for targeted objects compared to existing state‑of‑the‑art methods, while also serving as a robust uncertainty estimator for the global scene.

Abstract:
While specialized AI models excel at isolated video tasks like generation or understanding, real‑world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open‑source, omni‑capable multi‑agent framework for next‑generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan‑and‑Act dual‑agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video‑processing steps, while executor agents execute these through modular, MCP‑based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi‑level memory (global knowledge, task context, and user‑specific preferences), UniVA sustains long‑horizon reasoning, contextual continuity, and inter‑agent communication, enabling interactive and self‑reflective video creation with full traceability. This design enables iterative and any‑conditioned video workflows (e.g., text/image/video‑conditioned generation \rightarrow multi‑round editing \rightarrow object segmentation \rightarrow compositional synthesis) that were previously cumbersome to achieve with single‑purpose models or monolithic video‑language models. We also introduce UniVA‑Bench, a benchmark suite of multi‑step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA‑Bench are fully open‑sourced, aiming to catalyze research on interactive, agentic, and general‑purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

Abstract:
Despite recent advances in Open‑Vocabulary Semantic Segmentation (OVSS), existing training‑free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of transformer attention maps due to equal weighting or reliance on fixed‑size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We propose a strong baseline for training‑free OVSS termed as NERVE (Neighbourhood \& Entropy‑guided Random‑walk for open‑Vocabulary sEgmentation), which uniquely integrates global and fine‑grained local information, exploiting the neighbourhood structure from the self‑attention layer of a stable diffusion model. We also introduce a stochastic random walk for refining the affinity rather than relying on fixed‑size Gaussian kernels for local context. This spatial diffusion process encourages propagation across connected and semantically related areas, enabling it to effectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self‑attention maps from different transformer heads or layers equally, our method uses entropy‑based uncertainty to select the most relevant maps. Notably, our method does not require any conventional post‑processing techniques like Conditional Random Fields (CRF) or Pixel‑Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmentation benchmarks, yielding an overall state‑of‑the‑art zero‑shot segmentation performance, providing an effective approach to open‑vocabulary semantic segmentation.

Abstract:
We present Glioma C6, a new open dataset for instance segmentation of glioma C6 cells, designed as both a benchmark and a training resource for deep learning models. The dataset comprises 75 high‑resolution phase‑contrast microscopy images with over 12,000 annotated cells, providing a realistic testbed for biomedical image analysis. It includes soma annotations and morphological cell categorization provided by biologists. Additional categorization of cells, based on morphology, aims to enhance the utilization of image data for cancer cell research. Glioma C6 consists of two parts: the first is curated with controlled parameters for benchmarking, while the second supports generalization testing under varying conditions. We evaluate the performance of several generalist segmentation models, highlighting their limitations on our dataset. Our experiments demonstrate that training on Glioma C6 significantly enhances segmentation performance, reinforcing its value for developing robust and generalizable models. The dataset is publicly available for researchers.

Abstract:
Three‑dimensional feature extraction is a critical component of autonomous driving systems, where perception tasks such as 3D object detection, bird's‑eye‑view (BEV) semantic segmentation, and occupancy prediction serve as important constraints on 3D features. While large image encoders, high‑resolution images, and long‑term temporal inputs can significantly enhance feature quality and deliver remarkable performance gains, these techniques are often incompatible in both training and inference due to computational resource constraints. Moreover, different tasks favor distinct feature representations, making it difficult for a single model to perform end‑to‑end inference across multiple tasks while maintaining accuracy comparable to that of single‑task models. To alleviate these issues, we present the HENet and HENet++ framework for multi‑task 3D perception and end‑to‑end autonomous driving. Specifically, we propose a hybrid image encoding network that uses a large image encoder for short‑term frames and a small one for long‑term frames. Furthermore, our framework simultaneously extracts both dense and sparse features, providing more suitable representations for different tasks, reducing cumulative errors, and delivering more comprehensive information to the planning module. The proposed architecture maintains compatibility with various existing 3D feature extraction methods and supports multimodal inputs. HENet++ achieves state‑of‑the‑art end‑to‑end multi‑task 3D perception results on the nuScenes benchmark, while also attaining the lowest collision rate on the nuScenes end‑to‑end autonomous driving benchmark.

Abstract:
Spatial semantic segmentation of sound scenes (S5) consists of jointly performing audio source separation and sound event classification from a multichannel audio mixture. Evaluating S5 systems with separation and classification metrics individually makes system comparison difficult, whereas existing joint metrics, such as the class‑aware signal‑to‑distortion ratio (CA‑SDR), can conflate separation and labeling errors. In particular, CA‑SDR relies on predicted class labels for source matching, which may obscure label swaps or misclassifications when the underlying source estimates remain perceptually correct. In this work, we introduce the class and source‑aware signal‑to‑distortion ratio (CASA‑SDR), a new metric that performs permutation‑invariant source matching before computing classification errors, thereby shifting from a classification‑focused approach to a separation‑focused approach. We first analyze CA‑SDR in controlled scenarios with oracle separation and synthetic classification errors, as well as under controlled cross‑contamination between sources, and compare its behavior to that of the classical SDR and CASA‑SDR. We also study the impact of classification errors on the metrics by introducing error‑based and source‑based aggregation strategies. Finally, we compare CA‑SDR and CASA‑SDR on systems submitted to Task 4 of the DCASE 2025 challenge, highlighting the cases where CA‑SDR over‑penalizes label swaps or poorly separated sources, while CASA‑SDR provides a more interpretable separation‑centric assessment of S5 performance.

Abstract:
The deployment of autonomous service robots in human‑centric environments is hindered by a critical gap in perception and planning. Traditional navigation systems rely on expensive LiDARs that, while geometrically precise, are semantically unaware, they cannot distinguish a important document on an office floor from a harmless piece of litter, treating both as physically traversable. While advanced semantic segmentation exists, no prior work has successfully integrated this visual intelligence into a real‑time path planner that is efficient enough for low‑cost, embedded hardware. This paper presents a framework to bridge this gap, delivering context‑aware navigation on an affordable robotic platform. Our approach centers on a novel, tight integration of a lightweight perception module with an online A planner. The perception system employs a semantic segmentation model to identify user‑defined visual constraints, enabling the robot to navigate based on contextual importance rather than physical size alone. This adaptability allows an operator to define what is critical for a given task, be it sensitive papers in an office or safety lines in a factory, thus resolving the ambiguity of what to avoid. This semantic perception is seamlessly fused with geometric data. The identified visual constraints are projected as non‑geometric obstacles onto a global map that is continuously updated from sensor data, enabling robust navigation through both partially known and unknown environments. We validate our framework through extensive experiments in high‑fidelity simulations and on a real‑world robotic platform. The results demonstrate robust, real‑time performance, proving that a cost‑effective robot can safely navigate complex environments while respecting critical visual cues invisible to traditional planners.

Abstract:
Robotic‑ and computer‑assisted minimally invasive surgery (RAMIS) is increasingly relying on computer vision methods for reliable instrument recognition and surgical workflow understanding. Developing such systems often requires large, well‑annotated datasets, but existing resources often address isolated tasks, neglect temporal dependencies, or lack multi‑center variability. We present the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) dataset, comprising eight complete laparoscopic cholecystectomy videos recorded at three medical centers. The dataset provides frame‑level annotations for three interconnected tasks: surgical phase recognition (485,875 frames), instrument keypoint estimation (19,435 frames), and instrument instance segmentation (19,435 frames). PhaKIR is, to our knowledge, the first multi‑institutional dataset to jointly provide phase labels, instrument pose information, and pixel‑accurate instrument segmentations, while also enabling the exploitation of temporal context since full surgical procedure sequences are available. It served as the basis for the PhaKIR Challenge as part of the Endoscopic Vision (EndoVis) Challenge at MICCAI 2024 to benchmark methods in surgical scene understanding, thereby further validating the dataset's quality and relevance. The dataset is publicly available upon request via the Zenodo platform.

Abstract:
Rapid post‑earthquake damage assessment is crucial for rescue and resource planning. Still, existing remote sensing methods depend on costly aerial images, expert labeling, and produce only binary damage maps for early‑stage evaluation. Although ground‑level images from social networks provide a valuable source to fill this gap, a large pixel‑level annotated dataset for this task is still unavailable. We introduce EIDSeg, the first large‑scale semantic segmentation dataset specifically for post‑earthquake social media imagery. The dataset comprises 3,266 images from nine major earthquakes (2008‑2023), annotated across five classes of infrastructure damage: Undamaged Building, Damaged Building, Destroyed Building, Undamaged Road, and Damaged Road. We propose a practical three‑phase cross‑disciplinary annotation protocol with labeling guidelines that enables consistent segmentation by non‑expert annotators, achieving over 70% inter‑annotator agreement. We benchmark several state‑of‑the‑art segmentation models, identifying Encoder‑only Mask Transformer (EoMT) as the top‑performing method with a Mean Intersection over Union (mIoU) of 80.8%. By unlocking social networks' rich ground‑level perspective, our work paves the way for a faster, finer‑grained damage assessment in the post‑earthquake scenario.

Abstract:
Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground‑based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual trees and identifying semantic tree components. However, deep learning models typically require large amounts of annotated training data which limits further improvement. Producing dense, high‑quality annotations for 3D point clouds, especially in complex forests, is labor‑intensive and challenging to scale. We explore strategies to reduce dependence on large annotated datasets using self‑supervised and transfer learning architectures. Our objective is to improve performance across three tasks: instance segmentation, semantic segmentation, and tree classification using realistic and operational training sets. Our findings indicate that combining self‑supervised learning with domain adaptation significantly enhances instance segmentation compared to training from scratch (AP50 +16.98%), self‑supervised learning suffices for semantic segmentation (mIoU +1.79%), and hierarchical transfer learning enables accurate classification of unseen species (Jaccard +6.07%). To simplify use and encourage uptake, we integrated the tasks into a unified framework, streamlining the process from raw point clouds to tree delineation, structural analysis, and species classification. Pretrained models reduce energy consumption and carbon emissions by ~21%. This open‑source contribution aims to accelerate operational extraction of individual tree information from laser scanning point clouds to support forestry, biodiversity, and carbon mapping.

Abstract:
The perception of high‑definition maps is an integral component of environmental perception in autonomous driving systems. Existing research have often focused on online construction of high‑definition maps. For instance, the Maptr[9] series employ a detection‑based method to output vectorized map instances parallelly in an end‑to‑end manner. However, despite their capability for real‑time construction, detection‑based methods are observed to lack robust generalizability[19], which hampers their applicability in auto‑labeling systems. Therefore, aiming to improve the generalizability, we reinterpret road elements as rasterized polygons and design a concise framework based on instance segmentation. Initially, a segmentation‑based transformer is employed to deliver instance masks in an end‑to‑end manner; succeeding this step, a Potrace‑based[17] post‑processing module is used to ultimately yield vectorized map elements. Quantitative results attained on the Nuscene[1] dataset substantiate the effectiveness and generaliz‑ability of our method.

Abstract:
Chain‑of‑Thought (CoT) reasoning enhances the problem‑solving ability of large language models (LLMs) but leads to substantial inference overhead, limiting deployment in resource‑constrained settings. This paper investigates efficient CoT transfer across models of different scales and architectures through an adaptive reasoning summarization framework. The proposed method compresses reasoning traces via semantic segmentation with importance scoring, budget‑aware dynamic compression, and coherence reconstruction, preserving critical reasoning steps while significantly reducing token usage. Experiments on 7,501 medical examination questions across 10 specialties show up to 40% higher accuracy than truncation under the same token budgets. Evaluations on 64 model pairs from eight LLMs (1.5B‑32B parameters, including DeepSeek‑R1 and Qwen3) confirm strong cross‑model transferability. Furthermore, a Gaussian Process‑based Bayesian optimization module reduces evaluation cost by 84% and reveals a power‑law relationship between model size and cross‑domain robustness. These results demonstrate that reasoning summarization provides a practical path toward efficient CoT transfer, enabling advanced reasoning under tight computational constraints. Code will be released upon publication.

Abstract:
Ultrasound (US) video segmentation remains a challenging problem due to strong inter‑ and intra‑dataset variability, motion artifacts, and limited annotated data. Although foundation models such as Segment Anything Model 2 (SAM2) demonstrate strong zero‑shot and prompt‑guided segmentation capabilities, their performance deteriorates substantially when transferred to medical imaging domains. Current adaptation studies mainly emphasize architectural modifications, while the influence of data characteristics and training regimes has not been systematically examined. In this study, we present a comprehensive, data‑centric investigation of SAM2 adaptation for ultrasound video segmentation. We analyze how training‑set size, video duration, and augmentation schemes affect adaptation performance under three paradigms: task‑specific fine‑tuning, intermediate adaptation, and multi‑task joint training, across five SAM2 variants and multiple prompting modes. We further design six ultrasound‑specific augmentations, assessing their effect relative to generic strategies. Experiments on three representative ultrasound datasets reveal that data scale and temporal context play a more decisive role than model architecture or initialization. Moreover, joint training offers an efficient compromise between modality alignment and task specialization. This work aims to provide empirical insights for developing efficient, data‑aware adaptation pipelines for SAM2 in ultrasound video analysis.

Abstract:
Recent advances in 3D point cloud transformers have led to state‑of‑the‑art results in tasks such as semantic segmentation and reconstruction. However, these models typically rely on dense token representations, incurring high computational and memory costs during training and inference. In this work, we present the finding that tokens are remarkably redundant, leading to substantial inefficiency. We introduce gitmerge3D, a globally informed graph token merging method that can reduce the token count by up to 90‑95% while maintaining competitive performance. This finding challenges the prevailing assumption that more tokens inherently yield better performance and highlights that many current models are over‑tokenized and under‑optimized for scalability. We validate our method across multiple 3D vision tasks and show consistent improvements in computational efficiency. This work is the first to assess redundancy in large‑scale 3D transformer models, providing insights into the development of more efficient 3D foundation architectures. Our code and checkpoints are publicly available at https://gitmerge3d.github.io

Abstract:
Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre‑computed camera poses. We present 4D3R, a pose‑free dynamic neural rendering framework that decouples static and dynamic components through a two‑stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion‑aware refinement. 4D3R introduces two key technical innovations: (1) a motion‑aware bundle adjustment (MA‑BA) module that combines transformer‑based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion‑Aware Gaussian Splatting (MA‑GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high‑quality reconstruction. Extensive experiments on real‑world dynamic datasets demonstrate that our approach achieves up to 1.8dB PSNR improvement over state‑of‑the‑art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5x compared to previous dynamic scene representations.

Abstract:
Nuclei segmentation is the cornerstone task in histology image reading, shedding light on the underlying molecular patterns and leading to disease or cancer diagnosis. Yet, it is a laborious task that requires expertise from trained physicians. The large nuclei variability across different organ tissues and acquisition processes challenges the automation of this task. On the other hand, data annotations are expensive to obtain, and thus, Deep Learning (DL) models are challenged to generalize to unseen organs or different domains. This work proposes Local‑to‑Global NuSegHop (LG‑NuSegHop), a self‑supervised pipeline developed on prior knowledge of the problem and molecular biology. There are three distinct modules: (1) a set of local processing operations to generate a pseudolabel, (2) NuSegHop a novel data‑driven feature extraction model and (3) a set of global operations to post‑process the predictions of NuSegHop. Notably, even though the proposed pipeline uses no manually annotated training data or domain adaptation, it maintains a good generalization performance on other datasets. Experiments in three publicly available datasets show that our method outperforms other self‑supervised and weakly supervised methods while having a competitive standing among fully supervised methods. Remarkably, every module within LG‑NuSegHop is transparent and explainable to physicians.

Abstract:
Short‑video platforms have become a central medium in the modern Internet landscape, where efficient information delivery and strong interactivity are reshaping user engagement and cultural dissemination. Among the various forms of user interaction, comments play a vital role in fostering community participation and enabling content re‑creation. However, generating comments that are both compliant with platform guidelines and capable of exhibiting stylistic diversity and contextual awareness remains a significant challenge. We introduce LOLGORITHM, a modular multi‑agent system (MAS) designed for controllable short‑video comment generation. The system integrates video segmentation, contextual and affective analysis, and style‑aware prompt construction. It supports six distinct comment styles: puns (homophones), rhyming, meme application, sarcasm (irony), plain humor, and content extraction. Powered by a multimodal large language model (MLLM), LOLGORITHM directly processes video inputs and achieves fine‑grained style control through explicit prompt markers and few‑shot examples. To support development and evaluation, we construct a bilingual dataset using official APIs from Douyin (Chinese) and YouTube (English), covering five popular video genres: comedy skits, daily life jokes, funny animal clips, humorous commentary, and talk shows. Evaluation combines automated metrics originality, relevance, and style conformity with a large‑scale human preference study involving 40 videos and 105 participants. Results show that LOLGORITHM significantly outperforms baseline models, achieving preference rates of over 90% on Douyin and 87.55% on YouTube. This work presents a scalable and culturally adaptive framework for stylized comment generation on short‑video platforms, offering a promising path to enhance user engagement and creative interaction.

Abstract:
Deep learning semantic segmentation methods have shown promising performance for very high 1‑m resolution land cover classification, but the challenge of collecting large volumes of representative training data creates a significant barrier to widespread adoption of such models for meter‑scale land cover mapping over large areas. In this study, we present a novel label‑efficient approach for statewide 1‑m land cover classification using only 1,000 annotated reference image patches with self‑supervised deep learning. We use the "Bootstrap Your Own Latent" pre‑training strategy with a large amount of unlabeled color‑infrared aerial images (377,921 patches of 256x256 pixels at 1‑m resolution) to pre‑train a ResNet‑101 convolutional encoder. The learned encoder weights were subsequently transferred into multiple deep semantic segmentation architectures (FCN, U‑Net, Attention U‑Net, DeepLabV3+, UPerNet, PAN), which were then fine‑tuned using very small training dataset sizes with cross‑validation (250, 500, 750 patches). Among the fine‑tuned models, we obtained 87.14% overall accuracy and 75.58% macro F1 score using an ensemble of the best‑performing U‑Net models for comprehensive 1‑m, 8‑class land cover mapping, covering more than 123 billion pixels over the state of Mississippi, USA. Detailed qualitative and quantitative analysis revealed accurate mapping of open water and forested areas, while highlighting challenges in accurate delineation between cropland, herbaceous, and barren land cover types. These results show that self‑supervised learning is an effective strategy for reducing the need for large volumes of manually annotated data, directly addressing a major limitation to high spatial resolution land cover mapping at scale.

Abstract:
For developing safe Autonomous Driving Systems (ADS), rigorous testing is required before they are deemed safe for road deployments. Since comprehensive conventional physical testing is impractical due to cost and safety concerns, Virtual Testing Environments (VTE) can be adopted as an alternative. Comparing VTE‑generated sensor outputs against their real‑world analogues can be a strong indication that the VTE accurately represents reality. Correspondingly, this work explores a comprehensive experimental approach to finding evaluation metrics suitable for comparing real‑world and simulated LiDAR scans. The metrics were tested in terms of sensitivity and accuracy with different noise, density, distortion, sensor orientation, and channel settings. From comparing the metrics, we found that Density Aware Chamfer Distance (DCD) works best across all cases. In the second step of the research, a Virtual Testing Environment was generated using real LiDAR scan data. The data was collected in a controlled environment with only static objects using an instrumented vehicle equipped with LiDAR, IMU and cameras. Simulated LiDAR scans were generated from the VTEs using the same pose as real LiDAR scans. The simulated and LiDAR scans were compared in terms of model perception and geometric similarity. Actual and simulated LiDAR scans have a similar semantic segmentation output with a mIoU of 21% with corrected intensity and an average density aware chamfer distance (DCD) of 0.63. This indicates a slight difference in the geometric properties of simulated and real LiDAR scans and a significant difference between model outputs. During the comparison, density‑aware chamfer distance was found to be the most correlated among the metrics with perception methods.

Abstract:
Precise semantic segmentation of crops and weeds is necessary for agricultural weeding robots. However, training deep learning models requires large annotated datasets, which are costly to obtain in real fields. Synthetic data can reduce this burden, but the gap between simulated and real images remains a challenge. In this paper, we present a pipeline for procedural generation of synthetic crop‑weed images using Blender, producing annotated datasets under diverse conditions of plant growth, weed density, lighting, and camera angle. We benchmark several state‑of‑the‑art segmentation models on synthetic and real datasets and analyze their cross‑domain generalization. Our results show that training on synthetic images leads to a sim‑to‑real gap of 10%, surpassing previous state‑of‑the‑art methods. Moreover, synthetic data demonstrates good generalization properties, outperforming real datasets in cross‑domain scenarios. These findings highlight the potential of synthetic agricultural datasets and support hybrid strategies for more efficient model training.

Abstract:
Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self‑driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single‑scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time‑consuming, and cost‑intensive. To overcome this problem, we address the task of self‑supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two‑step approach of contrastive self‑supervised representation learning with subsequent supervised fine‑tuning using limited amounts of annotated data. We propose a novel clustering‑based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion‑aware representations of the radar data. Our method improves label efficiency after fine‑tuning, effectively boosting state‑of‑the‑art performance by self‑supervised pretraining.

Abstract:
Planktonic foraminifera, marine protists characterized by their intricate chambered shells, serve as valuable indicators of past and present environmental conditions. Understanding their chamber growth trajectory provides crucial insights into organismal development and ecological adaptation under changing environments. However, automated tracing of chamber growth from imaging data remains largely unexplored, with existing approaches relying heavily on manual segmentation of each chamber, which is time‑consuming and subjective. In this study, we propose an end‑to‑end pipeline that integrates instance segmentation, a computer vision technique not extensively explored in foraminifera, with a dedicated chamber ordering algorithm to automatically reconstruct three‑dimensional growth trajectories from high‑resolution computed tomography scans. We quantitatively and qualitatively evaluate multiple instance segmentation methods, each optimized for distinct spatial features of the chambers, and examine their downstream influence on growth‑order reconstruction accuracy. Experimental results on expert‑annotated datasets demonstrate that the proposed pipeline substantially reduces manual effort while maintaining biologically meaningful accuracy. Although segmentation models exhibit under‑segmentation in smaller chambers due to reduced voxel fidelity and subtle inter‑chamber connectivity, the chamber‑ordering algorithm remains robust, achieving consistent reconstruction of developmental trajectories even under partial segmentation. This work provides the first fully automated and reproducible pipeline for digital foraminiferal growth analysis, establishing a foundation for large‑scale, data‑driven ecological studies.

Abstract:
Off‑road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high‑resolution pathways or repeating high‑resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low‑resolution bottleneck; a gated cross‑attention injects fine‑scale detail, and only a sparse, uncertainty‑selected set of pixels is refined. The components are co‑designed and tightly integrated: global self‑attention with lightweight dilated depthwise refinement restores local coherence; a gated cross‑attention integrates fine‑scale features from a standard high‑resolution encoder stream without amplifying noise; and a class‑aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary‑band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference‑time cost. Overall, the results indicate competitive performance and improved stability across transitions.

Abstract:
Accurate segmentation of medical images is fundamental to tumor diagnosis and treatment planning. SAM‑based interactive segmentation has gained attention for its strong generalization, but most methods follow a single‑point‑to‑single‑object paradigm, which limits multi‑lesion segmentation. Moreover, ViT backbones capture global context but often miss high‑fidelity local details. We propose MIQ‑SAM3D, a multi‑instance 3D segmentation framework with a competitive query optimization strategy that shifts from single‑point‑to‑single‑mask to single‑point‑to‑multi‑instance. A prompt‑conditioned instance‑query generator transforms a single point prompt into multiple specialized queries, enabling retrieval of all semantically similar lesions across the 3D volume from a single exemplar. A hybrid CNN‑Transformer encoder injects CNN‑derived boundary saliency into ViT self‑attention via spatial gating. A competitively optimized query decoder then enables end‑to‑end, parallel, multi‑instance prediction through inter‑query competition. On LiTS17 and KiTS21 dataset, MIQ‑SAM3D achieved comparable levels and exhibits strong robustness to prompts, providing a practical solution for efficient annotation of clinically relevant multi‑lesion cases.

Abstract:
Adverse weather conditions, such as rain, snow, and fog, severely degrade LiDAR semantic segmentation by introducing refraction, scattering, and point dropouts that compromise geometric integrity. While prior approaches ranging from weather simulation and mixing‑based augmentation to domain randomization and regularization enhance robustness, they frequently overlook structural vulnerabilities inherent to object boundaries, corners, and highly sparse regions. To address this limitation, we propose a Light Geometry‑Aware Adapter. This module aligns azimuths and applies horizontal circular padding to preserve neighbor continuity across the 0 deg‑360 deg wrap‑around boundary. Using a local‑window K‑Nearest Neighbors (KNN) search, it aggregates nearby points and computes lightweight local statistics, compressing them into compact geometry‑aware cues. During training, these cues facilitate region‑aware regularization, which effectively stabilizes predictions in structurally fragile areas. The proposed adapter is designed to be plug‑and‑play, complements existing augmentation techniques, and operates exclusively during training, incurring negligible inference overhead. Operating under a rigorous source‑only cross‑weather paradigm wherein models are trained on SemanticKITTI and evaluated on SemanticSTF without target‑domain labels or fine‑tuning, our adapter achieves a +3.4 mIoU improvement over strong data‑centric augmentation baselines. Furthermore, it demonstrates performance comparable to advanced class‑centric regularization methods. These findings highlight that geometry‑driven regularization constitutes a critical pathway toward achieving highly robust, all‑weather LiDAR segmentation.

Abstract:
3D instance segmentation is an important task for real‑world applications. To avoid costly manual annotations, existing methods have explored generating pseudo labels by transferring 2D masks from foundation models to 3D. However, this approach is often suboptimal since the video frames are processed independently. This causes inconsistent segmentation granularity and conflicting 3D pseudo labels, which degrades the accuracy of final segmentation. To address this, we introduce a Granularity‑Consistent automatic 2D Mask Tracking approach that maintains temporal correspondences across frames, eliminating conflicting pseudo labels. Combined with a three‑stage curriculum learning framework, our approach progressively trains from fragmented single‑view data to unified multi‑view annotations, ultimately globally coherent full‑scene supervision. This structured learning pipeline enables the model to progressively expose to pseudo‑labels of increasing consistency. Thus, we can robustly distill a consistent 3D representation from initially fragmented and contradictory 2D priors. Experimental results demonstrated that our method effectively generated consistent and accurate 3D segmentations. Furthermore, the proposed method achieved state‑of‑the‑art results on standard benchmarks and open‑vocabulary ability.

Abstract:
Understanding surgical instrument‑tissue interactions requires not only identifying which instrument performs which action on which anatomical target, but also grounding these interactions spatially within the surgical scene. Existing surgical action triplet recognition methods are limited to learning from frame‑level classification, failing to reliably link actions to specific instrument instances.Previous attempts at spatial grounding have primarily relied on class activation maps, which lack the precision and robustness required for detailed instrument‑tissue interaction analysis.To address this gap, we propose grounding surgical action triplets with instrument instance segmentation, or triplet segmentation for short, a new unified task which produces spatially grounded <instrument, verb, target> outputs.We start by presenting CholecTriplet‑Seg, a large‑scale dataset containing over 30,000 annotated frames, linking instrument instance masks with action verb and anatomical target annotations, and establishing the first benchmark for strongly supervised, instance‑level triplet grounding and evaluation.To learn triplet segmentation, we propose TargetFusionNet, a novel architecture that extends Mask2Former with a target‑aware fusion mechanism to address the challenge of accurate anatomical target prediction by fusing weak anatomy priors with instrument instance queries.Evaluated across recognition, detection, and triplet segmentation metrics, TargetFusionNet consistently improves performance over existing baselines, demonstrating that strong instance supervision combined with weak target priors significantly enhances the accuracy and robustness of surgical action understanding.Triplet segmentation establishes a unified framework for spatially grounding surgical action triplets. The proposed benchmark and architecture pave the way for more interpretable, surgical scene understanding.

Abstract:
Although multimodal large language models (MLLMs) excel in high‑level vision‑language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency‑R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co‑salient Object Detection (CoSOD), enhancing the model's capacity for saliency reasoning. We introduce a textual interface with structured tags (<rg>, <ins>) to encode region‑ and instance‑level referring expressions, enabling a single referring segmenter to produce task‑appropriate masks. To train the MLLM efficiently, we propose Confidence‑Guided Policy Optimization (CGPO), a novel single‑sample reinforcement learning algorithm. CGPO improves on GRPO by replacing group‑normalized advantages with a per‑sample signal based on reward‑confidence discrepancy, thereby reducing computational waste, mitigating signal dilution, and lowering training overhead. Our model exceeds or matches the performance of robust open/closed‑source MLLMs and specialized state‑of‑the‑art methods across all three tasks, demonstrating the efficacy of our framework in saliency reasoning.

Abstract:
Rapid climate change and other disturbances in alpine ecosystems demand frequent habitat monitoring, yet manual mapping remains prohibitively expensive for the required temporal resolution. We employ deep learning for change detection using long‑term alpine habitat data from Gesaeuse National Park, Austria, addressing a major gap in applying geospatial foundation models (GFMs) to complex natural environments with fuzzy class boundaries and highly imbalanced classes. We compare two paradigms: post‑classification change detection (CD) versus direct CD. For post‑classification CD, we evaluate GFMs Prithvi‑EO‑2.0 and Clay v1.0 against U‑Net CNNs; for direct CD, we test the transformer ChangeViT against U‑Net baselines. Using high‑resolution multimodal data (RGB, NIR, LiDAR, terrain attributes) covering 4,480 documented changes over 15.3 km2, results show Clay v1.0 achieves 51% overall accuracy versus U‑Net's 41% for multi‑class habitat change, while both reach 67% for binary change detection. Direct CD yields superior IoU (0.53 vs 0.35) for binary but only 28% accuracy for multi‑class detection. Cross‑temporal evaluation reveals GFM robustness, with Clay maintaining 33% accuracy on 2020 data versus U‑Net's 23%. Integrating LiDAR improves semantic segmentation from 30% to 50% accuracy. Although overall accuracies are lower than in more homogeneous landscapes, they reflect realistic performance for complex alpine habitats. Future work will integrate object‑based post‑processing and physical constraints to enhance applicability.

Abstract:
Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture‑based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data‑efficient models. To investigate this, we introduce VessShape, a methodology for generating large‑scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre‑trained on VessShape images achieves strong few‑shot segmentation performance on two real‑world datasets from different domains, requiring only four to ten samples for fine‑tuning. Furthermore, the model exhibits notable zero‑shot capabilities, effectively segmenting vessels in unseen domains without any target‑specific training. Our results indicate that pre‑training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.

Abstract:
Parameter‑Efficient Fine‑Tuning (PEFT) has emerged as a key strategy for adapting large‑scale pre‑trained models to downstream tasks, but existing approaches face notable limitations. Addition‑based methods, such as Adapters, introduce inference latency and engineering complexity, whereas selection‑based methods like Gradient‑based Parameter Selection (GPS) require a full backward pass. The reliance on gradients not only incurs massive memory usage and substantial computational latency, but also leaves the selection vulnerable to the randomness of stochastic batch sampling. To resolve this, we propose Growth‑Driven Feedforward Parameter Selection (GD‑FPS). Operating entirely via forward passes, this strictly gradient‑free method identifies the optimal parameter subset by scaling intrinsic weight magnitudes by their relative activation growth against a pre‑training anchor. Evaluated on 26 visual tasks spanning image classification and semantic segmentation, GD‑FPS achieves competitive or superior performance over state‑of‑the‑art PEFT baselines. Crucially, compared to GPS, it reduces peak memory usage by nearly 18× and accelerates execution by over 2.7× during the parameter selection stage. By guaranteeing deterministic selection, GD‑FPS offers a memory‑efficient, fast, and robust solution for fine‑tuning.

Abstract:
Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset ‑‑ which includes over 125,000 annotated buildings across 12 cities ‑‑ we evaluate YOLOv11's performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4% mAP@50 and 38.3% mAP@50‑‑95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high‑rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well‑suited for real‑time, large‑scale urban mapping. This research highlights YOLOv11's potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.

Abstract:
This paper presents the Autonomous Driving Segment Anything Model (AD‑SAM), a fine‑tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD‑SAM extends the Segment Anything Model (SAM) with a dual‑encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual‑encoder produces multi‑scale fused representations by combining global semantic context from SAM's pretrained Vision Transformer (ViT‑H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet‑50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi‑stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz‑Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD‑SAM surpasses SAM, Generalized SAM (G‑SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G‑SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD‑SAM demonstrates strong cross‑domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30‑40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.

Abstract:
Age‑related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non‑invasive and cost‑effective imaging technique. The results of the ADAM challenge ‑ the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date ‑ serve as a benchmark for our evaluation. Taking the U‑Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model's architecture and training pipeline, including pre‑processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi‑class segmentation of different AMD lesion types in non‑invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.

Abstract:
Existing infrared and visible image fusion methods often face the dilemma of balancing modal information. Generative fusion methods reconstruct fused images by learning from data distributions, but their generative capabilities remain limited. Moreover, the lack of interpretability in modal information selection further affects the reliability and consistency of fusion results in complex scenarios. This manuscript revisits the essence of generative image fusion under the inspiration of human cognitive laws and proposes a novel infrared and visible image fusion method, termed HCLFuse. First, HCLFuse investigates the quantification theory of information mapping in unsupervised fusion networks, which leads to the design of a multi‑scale mask‑regulated variational bottleneck encoder. This encoder applies posterior probability modeling and information decomposition to extract accurate and concise low‑level modal information, thereby supporting the generation of high‑fidelity structural details. Furthermore, the probabilistic generative capability of the diffusion model is integrated with physical laws, forming a time‑varying physical guidance mechanism that adaptively regulates the generation process at different stages, thereby enhancing the ability of the model to perceive the intrinsic structure of data and reducing dependence on data quality. Experimental results show that the proposed method achieves state‑of‑the‑art fusion performance in qualitative and quantitative evaluations across multiple datasets and significantly improves semantic segmentation metrics. This fully demonstrates the advantages of this generative image fusion method, drawing inspiration from human cognition, in enhancing structural consistency and detail quality.

Abstract:
Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real‑time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel‑wise expert annotations for training remain time‑consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in‑context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra‑video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state‑of‑the‑art visual ICL models like Painter, MAE‑VQGAN, and conventional segmentation models like U‑Net and TransUNet by 1‑27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.

Abstract:
The vulnerability of cyclists, exacerbated by the rising popularity of faster e‑bikes, motivates adapting automotive perception technologies for bicycle safety. We use our multi‑sensor 'SenseBike' research platform to develop and evaluate a 3D LiDAR segmentation approach tailored to bicycles. To bridge the automotive‑to‑bicycle domain gap, we introduce the novel BikeScenes‑lidarseg Dataset, comprising 3021 consecutive LiDAR scans around the university campus of the TU Delft, semantically annotated for 29 dynamic and static classes. By evaluating model performance, we demonstrate that fine‑tuning on our BikeScenes dataset achieves a mean Intersection‑over‑Union (mIoU) of 63.6%, significantly outperforming the 13.8% obtained with SemanticKITTI pre‑training alone. This result underscores the necessity and effectiveness of domain‑specific training. We highlight key challenges specific to bicycle‑mounted, hardware‑constrained perception systems and contribute the BikeScenes dataset as a resource for advancing research in cyclist‑centric LiDAR segmentation.

Abstract:
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open‑vocabulary object‑part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open‑vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object‑part hierarchies in language space. It integrates the MLLM into the object‑part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi‑granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in‑domain and cross‑dataset object‑part instance segmentation, and zero‑shot semantic segmentation. LangHOPS achieves state‑of‑the‑art results, surpassing previous methods by 5.5% Average Precision (AP) (in‑domain) and 4.8% (cross‑dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero‑shot). Ablation studies further validate the effectiveness of the language‑grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.

Abstract:
Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model's effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context‑Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset‑level) and local (image‑level) contextual information. Specifically, we leverage a memory bank to learn dataset‑level contextual information of each class, incorporating the class‑specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher‑student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state‑of‑the‑art performance across several datasets, including ADE20K, COCO‑Stuff10K, and Pascal‑Context.

Abstract:
Current 3D scene understanding methods are limited by offline‑collected multi‑view data or pre‑constructed 3D geometry. In this paper, we present ExtractAnything3D (EA3D), a unified online framework for open‑world 3D object extraction that enables simultaneous geometric reconstruction and holistic scene understanding. Given a streaming video, EA3D dynamically interprets each frame using vision‑language and 2D vision foundation encoders to extract object‑level knowledge. This knowledge is integrated and embedded into a Gaussian feature map via a feed‑forward online update strategy. We then iteratively estimate visual odometry from historical frames and incrementally update online Gaussian features with new observations. A recurrent joint optimization module directs the model's attention to regions of interest, simultaneously enhancing both geometric reconstruction and semantic understanding. Extensive experiments across diverse benchmarks and tasks, including photo‑realistic rendering, semantic and instance segmentation, 3D bounding box and semantic occupancy estimation, and 3D mesh generation, demonstrate the effectiveness of EA3D. Our method establishes a unified and efficient framework for joint online 3D reconstruction and holistic scene understanding, enabling a broad range of downstream tasks.

Abstract:
Class Activation Mapping (CAM) methods are widely applied in weakly supervised learning tasks due to their ability to highlight object regions. However, conventional CAM methods highlight only the most discriminative regions of the target. These highlighted regions often fail to cover the entire object and are frequently misaligned with object boundaries, thereby limiting the performance of downstream weakly supervised learning tasks, particularly Weakly Supervised Semantic Segmentation (WSSS), which demands pixel‑wise accurate activation maps to get the best results. To alleviate the above problems, we propose a novel activation method, Region‑CAM. Distinct from network feature weighting approaches, Region‑CAM generates activation maps by extracting semantic information maps (SIMs) and performing semantic information propagation (SIP) by considering both gradients and features in each of the stages of the baseline classification model. Our approach highlights a greater proportion of object regions while ensuring activation maps to have precise boundaries that align closely with object edges. Region‑CAM achieves 60.12% and 58.43% mean intersection over union (mIoU) using the baseline model on the PASCAL VOC training and validation datasets, respectively, which are improvements of 13.61% and 13.13% over the original CAM (46.51% and 45.30%). On the MS COCO validation set, Region‑CAM achieves 36.38%, a 16.23% improvement over the original CAM (20.15%). We also demonstrate the superiority of Region‑CAM in object localization tasks, using the ILSVRC2012 validation set. Region‑CAM achieves 51.7% in Top‑1 Localization accuracy Loc1. Compared with LayerCAM, an activation method designed for weakly supervised object localization, Region‑CAM achieves 4.5% better performance in Loc1.

Authors: Inclusion AI, :, Bowen Ma, Cheng Zou, ChengKun Du, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Chenyu Lian, Chengxiang Fan, Dandan Zheng, Fudong Wang, Furong Xu, Guangming Yao, Haohao Liu, Han Peng, Jun Zhou, Junluan Xia, Jingdong Chen, Jianing Li, Jianxin Sun, Jianjiang Zhu, Jianping Jiang, Jinpeng Ou, Jun Peng, Jin Peng, Kaixiang Ji, Li Tang, Libin Wang, Lixiang Ru, Longhua Tan, Lu Ma, Lan Wang, Mochen Bai, Minghong Cai, Mingxue Yang, Ning Gao, Qingpei Guo, Qinglong Zhang, Qiang Xu, Qin Zhao, Rui Liu, Ruijie Xiong, Ruobing Zheng, Sirui Gao, Shaoxiong Lin, Tao Zhang, Tianqi Li, Tinghao Liu, Tongli Wang, Taoye Huang, Weilong Chai, Xiaomei Wang, Xiaolong Wang, Xiaojian Liu, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Xuezhi Wang, Yi Yuan, Yuting Gao, Yuting Xiao, Yunxiao Sun, Yipeng Chen, Yifan Mao, Yifei Wu, Yongjie Lyu, Yingying Zhang, YuQian Li, Ziping Ma, Zhiqiang Fang, Zhihao Qiu, Ziyuan Huang, Zizheng Yang, Zhengyu He

Abstract:
We propose Ming‑Flash‑Omni, an upgraded version of Ming‑Omni, built upon a sparser Mixture‑of‑Experts (MoE) variant of Ling‑Flash‑2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence across vision, speech, and language, representing a key step toward Artificial General Intelligence (AGI). Compared to its predecessor, the upgraded version exhibits substantial improvements across multimodal understanding and generation. Notably, it achieves strong performance on vision‑language understanding benchmarks, with overall scores on par with Gemini 2.5 Pro, and enables seamless switching among multimodal tasks in multi‑turn interactions. In speech, it achieves strong performance in contextual and dialect‑aware ASR while enabling joint, continuous‑generation of speech, sound, and music. In vision, it introduces generative semantic segmentation that achieves competitive standalone performance and enhances spatial control and editing consistency, alongside marked improvements in identity preservation, and high‑fidelity in‑image text rendering. Together, these capabilities demonstrate that a single unified model can serve as a practical foundation for general‑purpose multimodal intelligence.

Abstract:
The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R‑CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R‑CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R‑CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R‑CNN ResNet50 V1 and Mask RCNN.

Abstract:
Ensuring transparency and trust in artificial intelligence (AI) models is essential as they are increasingly deployed in safety‑critical and high‑stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge; however, the rigorous evaluation of XAI methods remains vital for balancing the trade‑offs between model complexity, predictive performance, and interpretability. While substantial progress has been made in evaluating XAI for classification tasks, strategies tailored to semantic segmentation remain limited. Moreover, objectively assessing XAI approaches is difficult, since qualitative visual explanations provide only preliminary insights. Such qualitative methods are inherently subjective and cannot ensure the accuracy or stability of explanations. To address these limitations, this work introduces a comprehensive quantitative evaluation framework for assessing XAI in semantic segmentation, accounting for both spatial and contextual task complexities. The framework systematically integrates pixel‑level evaluation strategies with carefully designed metrics to yield fine‑grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)‑based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings advance the development of transparent, trustworthy, and accountable semantic segmentation models.

Abstract:
Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP‑SAM2, the first cross‑prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross‑prompt transferability, we begin by designing a target‑scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP‑SAM2 significantly outperforms state‑of‑the‑art (SOTA) attacks by a large margin.

Abstract:
Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image‑level pre‑training objectives and the pixel‑level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT‑CLIP, a novel training‑free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image‑text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT‑B/16 and 8 layers in ViT‑L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT‑B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic‑spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre‑trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT‑CLIP achieves state‑of‑the‑art performance across diverse scenarios, highlighting its effectiveness and practicality for real‑world deployment.

Abstract:
Annotating real‑world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self‑training‑based Unsupervised Domain Adaptation (UDA) has been widely used to improve point cloud semantic segmentation by leveraging synthetic point cloud data. However, we argue that existing methods do not effectively utilize unlabeled data, as they either rely on predefined or fixed confidence thresholds, resulting in suboptimal performance. In this paper, we propose a Dynamic Pseudo‑Label Filtering (DPLF) scheme to enhance real data utilization in point cloud UDA semantic segmentation. Additionally, we design a simple and efficient Prior‑Guided Data Augmentation Pipeline (PG‑DAP) to mitigate domain shift between synthetic and real‑world point clouds. Finally, we utilize data mixing consistency loss to push the model to learn context‑free representations. We implement and thoroughly evaluate our approach through extensive comparisons with state‑of‑the‑art methods. Experiments on two challenging synthetic‑to‑real point cloud semantic segmentation tasks demonstrate that our approach achieves superior performance. Ablation studies confirm the effectiveness of the DPLF and PG‑DAP modules. We release the code of our method in this paper.

Abstract:
Remembering where object segments were predicted in the past is useful for improving the accuracy and consistency of class‑agnostic video segmentation algorithms. Existing video segmentation algorithms typically use either no object‑level memory (e.g. FastSAM) or they use implicit memories in the form of recurrent neural network features (e.g. SAM2). In this paper, we augment both types of segmentation models using an explicit 3D memory and show that the resulting models have more accurate and consistent predictions. For this, we develop an online 3D Gaussian Splatting (3DGS) technique to store predicted object‑level segments generated throughout the duration of a video. Based on this 3DGS representation, a set of fusion techniques are developed, named FastSAM‑Splat and SAM2‑Splat, that use the explicit 3DGS memory to improve their respective foundation models' predictions. Ablation experiments are used to validate the proposed techniques' design and hyperparameter settings. Results from both real‑world and simulated benchmarking experiments show that models which use explicit 3D memories result in more accurate and consistent predictions than those which use no memory or only implicit neural network memories. Project Page: https://topipari.com/projects/FastSAM‑Splat/

Abstract:
Spiking Neural Networks (SNNs) are gaining attention as energy‑efficient alternatives to Artificial Neural Networks (ANNs), especially in resource‑constrained settings. While ANN‑to‑SNN conversion (ANN2SNN) achieves high accuracy without end‑to‑end SNN training, existing methods rely on large time steps, leading to high inference latency and computational cost. In this paper, we propose a theoretical and practical framework for single‑timestep ANN2SNN. We establish the Temporal‑to‑Spatial Equivalence Theory, proving that multi‑timestep integrate‑and‑fire (IF) neurons can be equivalently replaced by single‑timestep multi‑threshold neurons (MTN). Based on this theory, we introduce the Scale‑and‑Fire Neuron (SFN), which enables effective single‑timestep (T=1) spiking through adaptive scaling and firing. Furthermore, we develop the SFN‑based Spiking Transformer (SFormer), a specialized instantiation of SFN within Transformer architectures, where spike patterns are aligned with attention distributions to mitigate the computational, energy, and hardware overhead of the multi‑threshold design. Extensive experiments on image classification, object detection, and instance segmentation demonstrate that our method achieves state‑of‑the‑art performance under single‑timestep inference. Notably, we achieve 88.8% top‑1 accuracy on ImageNet‑1K at T=1, surpassing existing conversion methods.

Abstract:
We present Seq‑DeepIPC, a sequential end‑to‑end perception‑to‑control model for legged robot navigation in realworld environments. Seq‑DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi‑modal perception (RGB‑D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use EfficientNet‑B0 as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead computing the bearing angle directly from consecutive GNSS positions. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq‑DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq‑DeepIPC achieves competitive or better results with reasonable model size; although GNSS‑only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq‑DeepIPC extends end‑to‑end navigation beyond wheeled robots to more versatile and temporally‑aware systems. To support future research, we will release the codes to our GitHub repository at https://github.com/oskarnatan/Seq‑DeepIPC.

Abstract:
Self‑supervised learning (SSL) has recently emerged as a key strategy for building foundation models in remote sensing, where the scarcity of annotated data limits the applicability of fully supervised approaches. In this work, we introduce WaveMAE, a masked autoencoding framework tailored for multispectral satellite imagery. Unlike conventional pixel‑based reconstruction, WaveMAE leverages a multi‑level Discrete Wavelet Transform (DWT) to disentangle frequency components and guide the encoder toward learning scale‑aware high‑frequency representations. We further propose a Geo‑conditioned Positional Encoding (GPE), which incorporates geographical priors via Spherical Harmonics, encouraging embeddings that respect both semantic and geospatial structure. To ensure fairness in evaluation, all methods are pretrained on the same dataset (fMoW‑S2) and systematically evaluated on the diverse downstream tasks of the PANGAEA benchmark, spanning semantic segmentation, regression, change detection, and multilabel classification. Extensive experiments demonstrate that WaveMAE achieves consistent improvements over prior state‑of‑the‑art approaches, with substantial gains on segmentation and regression benchmarks. The effectiveness of WaveMAE pretraining is further demonstrated by showing that even a lightweight variant, containing only 26.4% of the parameters, achieves state‑of‑the‑art performance. Our results establish WaveMAE as a strong and geographically informed foundation model for multispectral remote sensing imagery.

Abstract:
The growing memory footprint of the Key‑Value (KV) cache poses a severe scalability bottleneck for long‑context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token‑, block‑, and sentence‑level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underlinesemantic‑aware KV cache eviction framework with \underlineadaptive \underlineblock sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment‑guided token scoring to refine token importance estimation. Finally, for each segment, a budget‑driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long‑context benchmarks demonstrate that SABlock consistently outperforms state‑of‑the‑art baselines under the same memory budgets. For instance, on Needle‑in‑a‑Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full‑cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.

Abstract:
Pretrained models are ubiquitous in the current deep learning landscape, offering strong results on a broad range of tasks. Recent works have shown that models differing in various design choices exhibit categorically diverse generalization behavior, resulting in one model grasping distinct data‑specific insights unavailable to the other. In this paper, we propose to leverage large publicly available model repositories as an auxiliary source of model improvements. We introduce a data partitioning strategy where pretrained models autonomously adopt either the role of a student, seeking knowledge, or that of a teacher, imparting knowledge. Experiments across various tasks demonstrate the effectiveness of our proposed approach. In image classification, we improved the performance of ViT‑B by approximately 1.4% through bidirectional knowledge transfer with ViT‑T. For semantic segmentation, our method boosted all evaluation metrics by enabling knowledge transfer both within and across backbone architectures. In video saliency prediction, our approach achieved a new state‑of‑the‑art. We further extend our approach to knowledge transfer between multiple models, leading to considerable performance improvements for all model participants.

Abstract:
While recent semantic segmentation networks heavily rely on powerful pretrained encoders, most employ simplistic decoders, leading to suboptimal trade‑offs between semantic context and fine‑grained detail preservation. To address this, we propose a novel decoder architecture, WaveSeg, which jointly optimizes feature refinement in spatial and wavelet domains. Specifically, high‑frequency components are first learned from input images as explicit priors to reinforce boundary details at early stages. A multi‑scale fusion mechanism, Dual Domain Operation (DDO), is then applied, and the novel Spectrum Decomposition Attention (SDA) block is proposed, which is developed to leverage Mamba's linear‑complexity long‑range modeling to enhance high‑frequency structural details. Meanwhile, reparameterized convolutions are applied to preserve low‑frequency semantic integrity in the wavelet domain. Finally, a residual‑guided fusion integrates multi‑scale features with boundary‑aware representations at native resolution, producing semantically and structurally rich feature maps. Extensive experiments on standard benchmarks demonstrate that WaveSeg, leveraging wavelet‑domain frequency prior with Mamba‑based attention, consistently outperforms state‑of‑the‑art approaches both quantitatively and qualitatively, achieving efficient and precise segmentation.

Abstract:
This paper investigates methods for estimating uncertainty in semantic segmentation predictions derived from satellite imagery. Estimating uncertainty for segmentation presents unique challenges compared to standard image classification, requiring scalable methods producing per‑pixel estimates. While most research on this topic has focused on scene understanding or medical imaging, this work benchmarks existing methods specifically for remote sensing and Earth observation applications. Our evaluation focuses on the practical utility of uncertainty measures, testing their ability to identify prediction errors and noise‑corrupted input image regions. Experiments are conducted on two remote sensing datasets, PASTIS and ForTy, selected for their differences in scale, geographic coverage, and label confidence. We perform an extensive evaluation featuring several models, such as Stochastic Segmentation Networks and ensembles, in combination with a number of neural architectures and uncertainty metrics. We make a number of practical recommendations based on our findings.

Abstract:
Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor‑intensive and time‑consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high‑resolution dataset of stem‑cell‑derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.

Abstract:
Semantic segmentation of electron microscopy (EM) images of biological samples remains a challenge in the life sciences. EM data captures details of biological structures, sometimes with such complexity that even human observers can find it overwhelming. We introduce ε‑Seg, a method based on hierarchical variational autoencoders (HVAEs), employing center‑region masking, sparse label contrastive learning (CL), a Gaussian mixture model (GMM) prior, and clustering‑free label prediction. Center‑region masking and the inpainting loss encourage the model to learn robust and representative embeddings to distinguish the desired classes, even if training labels are sparse (0.05% of the total image data or less). For optimal performance, we employ CL and a GMM prior to shape the latent space of the HVAE such that encoded input patches tend to cluster wrt. the semantic classes we wish to distinguish. Finally, instead of clustering latent embeddings for semantic segmentation, we propose a MLP semantic segmentation head to directly predict class labels from latent embeddings. We show empirical results of ε‑Seg and baseline methods on 2 dense EM datasets of biological tissues and demonstrate the applicability of our method also on fluorescence microscopy data. Our results show that ε‑Seg is capable of achieving competitive sparsely‑supervised segmentation results on complex biological image data, even if only limited amounts of training labels are available.

Abstract:
Reliable navigation in safety‑critical environments requires both accurate hazard perception and principled uncertainty handling to strengthen downstream safety handling. Despite the effectiveness of existing approaches, they assume perfect hazard detection capabilities, while uncertainty‑aware perception approaches lack finite‑sample guarantees. We present COPPOL, a conformal‑driven perception‑to‑policy learning approach that integrates distribution‑free, finite‑sample safety guarantees into semantic segmentation, yielding calibrated hazard maps with rigorous bounds for missed detections. These maps induce risk‑aware cost fields for downstream RL planning. Across two satellite‑derived benchmarks, COPPOL increases hazard coverage (up to 6x) compared to comparative baselines, achieving near‑complete detection of unsafe regions while reducing hazardous violations during navigation (up to approx 50%). More importantly, our approach remains robust to distributional shift, preserving both safety and efficiency.

Abstract:
3D weakly supervised semantic segmentation (3D WSSS) aims to achieve semantic segmentation by leveraging sparse or low‑cost annotated data, significantly reducing reliance on dense point‑wise annotations. Previous works mainly employ class activation maps or pre‑trained vision‑language models to address this challenge. However, the low quality of pseudo‑labels and the insufficient exploitation of 3D geometric priors jointly create significant technical bottlenecks in developing high‑performance 3D WSSS models. In this paper, we propose a simple yet effective 3D weakly supervised semantic segmentation method that integrates 3D geometric priors into a class‑aware guidance mechanism to generate high‑fidelity pseudo labels. Concretely, our designed methodology first employs Class‑Aware Label Refinement module to generate more balanced and accurate pseudo labels for semantic categrories. This initial refinement stage focuses on enhancing label quality through category‑specific optimization. Subsequently, the Geometry‑Aware Label Refinement component is developed, which strategically integrates implicit 3D geometric constraints to effectively filter out low‑confidence pseudo labels that fail to comply with geometric plausibility. Moreover, to address the challenge of extensive unlabeled regions, we propose a Label Update strategy that integrates Self‑Training to propagate labels into these areas. This iterative process continuously enhances pseudo‑label quality while expanding label coverage, ultimately fostering the development of high‑performance 3D WSSS models. Comprehensive experimental validation reveals that our proposed methodology achieves state‑of‑the‑art performance on both ScanNet and S3DIS benchmarks while demonstrating remarkable generalization capability in unsupervised settings, maintaining competitive accuracy through its robust design.

Abstract:
With the development of underwater exploration and marine protection, underwater vision tasks are widespread. Due to the degraded underwater environment, characterized by color distortion, low contrast, and blurring, camouflaged instance segmentation (CIS) faces greater challenges in accurately segmenting objects that blend closely with their surroundings. Traditional camouflaged instance segmentation methods, trained on terrestrial‑dominated datasets with limited underwater samples, may exhibit inadequate performance in underwater scenes. To address these issues, we introduce the first underwater camouflaged instance segmentation (UCIS) dataset, abbreviated as UCIS4K, which comprises 3,953 images of camouflaged marine organisms with instance‑level annotations. In addition, we propose an Underwater Camouflaged Instance Segmentation network based on Segment Anything Model (UCIS‑SAM). Our UCIS‑SAM includes three key modules. First, the Channel Balance Optimization Module (CBOM) enhances channel characteristics to improve underwater feature learning, effectively addressing the model's limited understanding of underwater environments. Second, the Frequency Domain True Integration Module (FDTIM) is proposed to emphasize intrinsic object features and reduce interference from camouflage patterns, enhancing the segmentation performance of camouflaged objects blending with their surroundings. Finally, the Multi‑scale Feature Frequency Aggregation Module (MFFAM) is designed to strengthen the boundaries of low‑contrast camouflaged instances across multiple frequency bands, improving the model's ability to achieve more precise segmentation of camouflaged objects. Extensive experiments on the proposed UCIS4K and public benchmarks show that our UCIS‑SAM outperforms state‑of‑the‑art approaches.

Abstract:
Deploying real‑time spatial perception on edge devices requires efficient multi‑task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi‑Mono‑Hydra (M2H), a novel multi‑task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single‑task models or shared encoder‑decoder architectures, M2H introduces a Window‑Based Cross‑Task Attention Module that enables structured feature exchange while preserving task‑specific details, improving prediction consistency across tasks. Built on a lightweight ViT‑based DINOv2 backbone, M2H is optimized for real‑time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic environments. Comprehensive evaluations show that M2H outperforms state‑of‑the‑art multi‑task models on NYUDv2, surpasses single‑task depth and semantic baselines on Hypersim, and achieves superior performance on the Cityscapes dataset, all while maintaining computational efficiency on laptop hardware. Beyond benchmarks, M2H is validated on real‑world data, demonstrating its practicality in spatial perception tasks.

Abstract:
The escalating threat of weapon‑related violence necessitates automated detection systems capable of pixel‑level precision for accurate threat assessment in real‑time security applications. Traditional weapon detection approaches rely on object detection frameworks that provide only coarse bounding box localizations, lacking the fine‑grained segmentation required for comprehensive threat analysis. Furthermore, existing semantic segmentation models either sacrifice accuracy for computational efficiency or require excessive computational resources incompatible with edge deployment scenarios. This paper presents ArmFormer, a lightweight transformer‑based semantic segmentation framework that strategically integrates Convolutional Block Attention Module (CBAM) with MixVisionTransformer architecture to achieve superior accuracy while maintaining computational efficiency suitable for resource‑constrained edge devices. Our approach combines CBAM‑enhanced encoder backbone with attention‑integrated hamburger decoder to enable multi‑class weapon segmentation across five categories: handgun, rifle, knife, revolver, and human. Comprehensive experiments demonstrate that ArmFormer achieves state‑of‑the‑art performance with 80.64% mIoU and 89.13% mFscore while maintaining real‑time inference at 82.26 FPS. With only 4.886G FLOPs and 3.66M parameters, ArmFormer outperforms heavyweight models requiring up to 48x more computation, establishing it as the optimal solution for deployment on portable security cameras, surveillance drones, and embedded AI accelerators in distributed security infrastructure.

Abstract:
Modern automotive systems leverage deep neural networks (DNNs) for semantic segmentation and operate in two key application areas: (1) In‑car, where the DNN solely operates in the vehicle without strict constraints on the data rate. (2) Distributed, where one DNN part operates in the vehicle and the other part typically on a large‑scale cloud platform with a particular constraint on transmission bitrate efficiency. Typically, both applications share an image and source encoder, while each uses distinct (joint) source and task decoders. Prior work utilized convolutional neural networks for joint source and task decoding but did not investigate transformer‑based alternatives such as SegDeformer, which offer superior performance at the cost of higher computational complexity. In this work, we propose joint feature and task decoding for SegDeformer, thereby enabling lower computational complexity in both in‑car and distributed applications, despite SegDeformer's computational demands. This improves scalability in the cloud while reducing in‑car computational complexity. For the in‑car application, we increased the frames per second (fps) by up to a factor of 11.7 (1.4 fps to 16.5 fps) on Cityscapes and by up to a factor of 3.5 (43.3 fps to 154.3 fps) on ADE20K, while being on‑par w.r.t.\ the mean intersection over union (mIoU) of the transformer‑based baseline that doesn't compress by a source codec. For the distributed application, we achieve state‑of‑the‑art (SOTA) over a wide range of bitrates on the mIoU metric, while using only 0.14% (0.04%) of cloud DNN parameters used in previous SOTA, reported on ADE20K (Cityscapes).

Abstract:
Coral reefs are vital yet fragile ecosystems that require accurate large‑scale mapping for effective conservation. Although global products such as the Allen Coral Atlas provide unprecedented coverage of global coral reef distri‑bution, their predictions are frequently limited in spatial precision and semantic consistency, especially in regions requiring fine‑grained boundary delineation. To address these challenges, we propose UKANFormer, a novel se‑mantic segmentation model designed to achieve high‑precision mapping under noisy supervision derived from Allen Coral Atlas. Building upon the UKAN architecture, UKANFormer incorporates a Global‑Local Transformer (GL‑Trans) block in the decoder, enabling the extraction of both global semantic structures and local boundary details. In experiments, UKANFormer achieved a coral‑class IoU of 67.00% and pixel accuracy of 83.98%, outperforming conventional baselines under the same noisy labels setting. Remarkably, the model produces predictions that are visually and structurally more accurate than the noisy labels used for training. These results challenge the notion that data quality directly limits model performance, showing that architectural design can mitigate label noise and sup‑port scalable mapping under imperfect supervision. UKANFormer provides a foundation for ecological monitoring where reliable labels are scarce.

Abstract:
This paper presents a vision‑only autonomous flight system for small UAVs operating in controlled indoor environments. The system combines semantic segmentation with monocular depth estimation to enable obstacle avoidance, scene exploration, and autonomous safe landing operations without requiring GPS or expensive sensors such as LiDAR. A key innovation is an adaptive scale factor algorithm that converts non‑metric monocular depth predictions into accurate metric distance measurements by leveraging semantic ground plane detection and camera intrinsic parameters, achieving a mean distance error of 14.4 cm. The approach uses a knowledge distillation framework where a color‑based Support Vector Machine (SVM) teacher generates training data for a lightweight U‑Net student network (1.6M parameters) capable of real‑time semantic segmentation. For more complex environments, the SVM teacher can be replaced with a state‑of‑the‑art segmentation model. Testing was conducted in a controlled 5x4 meter laboratory environment with eight cardboard obstacles simulating urban structures. Extensive validation across 30 flight tests in a real‑world environment and 100 flight tests in a digital‑twin environment demonstrates that the combined segmentation and depth approach increases the distance traveled during surveillance and reduces mission time while maintaining 100% success rates. The system is further optimized through end‑to‑end learning, where a compact student neural network learns complete flight policies from demonstration data generated by our best‑performing method, achieving an 87.5% autonomous mission success rate. This work advances practical vision‑based drone navigation in structured environments, demonstrating solutions for metric depth estimation and computational efficiency challenges that enable deployment on resource‑constrained platforms.

Abstract:
Computer‑assisted surgery research requires large, deeply annotated video datasets that capture clinical and technical variability. Existing cataract surgery resources lack the diversity and annotation depth required to train generalizable deep‑learning models. To address this gap, we present a dataset of 3,000 phacoemulsification cataract surgery videos acquired at two surgical centers from surgeons with varying expertise. The dataset provides four annotation layers: temporal surgical phases, instance segmentation of instruments and anatomical structures, instrument‑tissue interaction tracking, and quantitative skill scores based on competency rubrics adapted from ICO‑OSCAR and GRASIS. We demonstrate the technical utility of the dataset through benchmarking deep learning models across four tasks: workflow recognition, scene segmentation, instrument‑tissue interaction tracking, and automated skill assessment. Furthermore, we establish a domain‑adaptation baseline for phase recognition and instance segmentation by training on one surgical center and evaluating on a held‑out center. Ultimately, these multi‑source acquisitions, multi‑layer annotations, and paired skill‑kinematic labels facilitate the development of generalizable multi‑task models for surgical workflow analysis, scene understanding, and competency‑based training research.

Abstract:
Open‑Vocabulary Semantic Segmentation (OVSS) assigns pixel‑level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision‑language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro‑symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation‑based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., <cat, to‑right‑of, person>, and encodes them as first‑order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., "cat") and a spatial pseudo category (e.g., "right of person") simultaneously, enforcing relational constraints (e.g., a "cat" pixel must lie to the right of a "person"). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end‑to‑end learning of spatial‑relationally consistent segmentation. RelateSeg achieves state‑of‑the‑art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.

Abstract:
Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder‑decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC‑superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN‑16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state‑of‑the‑art models when trained on coarse annotations.

Abstract:
Street‑view imagery (SVI) offers a fine‑grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street‑level indicators that capture accident‑related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero‑shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi‑class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident‑type‑specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high‑risk corridor diagnosis, offering a scalable, data‑informed tool for urban road safety planning.

Abstract:
Most existing underwater instance segmentation approaches are constrained by close‑vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce MARIS (\underlineMarine Open‑Vocabulary \underlineInstance \underlineSegmentation), the first large‑scale fine‑grained benchmark for underwater Open‑Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (GPEM) leverages stable part‑level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (SAIM) enriches language embeddings with domain‑specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In‑Domain and Cross‑Domain setting on MARIS, establishing a strong foundation for future underwater perception research.

Abstract:
Semantic segmentation labels each pixel in an image with its corresponding class, and is typically evaluated using the Intersection over Union (IoU) and Dice metrics to quantify the overlap between predicted and ground‑truth segmentation masks. In the literature, most existing methods estimate pixel‑wise class probabilities, then apply argmax or thresholding to obtain the final prediction. These methods have been shown to generally lead to inconsistent or suboptimal results, as they do not directly maximize segmentation metrics. To address this issue, a novel consistent segmentation framework, RankSEG, has been proposed, which includes RankDice and RankIoU specifically designed to optimize the Dice and IoU metrics, respectively. Although RankSEG almost guarantees improved performance, it suffers from two major drawbacks. First, it is its computational expense‑RankDice has a complexity of O(d log d) with a substantial constant factor (where d represents the number of pixels), while RankIoU exhibits even higher complexity O(d^2), thus limiting its practical application. For instance, in LiTS, prediction with RankSEG takes 16.33 seconds compared to just 0.01 seconds with the argmax rule. Second, RankSEG is only applicable to overlapping segmentation settings, where multiple classes can occupy the same pixel, which contrasts with standard benchmarks that typically assume non‑overlapping segmentation. In this paper, we overcome these two drawbacks via a reciprocal moment approximation (RMA) of RankSEG with the following contributions: (i) we improve RankSEG using RMA, namely RankSEG‑RMA, reduces the complexity of both algorithms to O(d) while maintaining comparable performance; (ii) inspired by RMA, we develop a pixel‑wise score function that allows efficient implementation for non‑overlapping segmentation settings.

Abstract:
As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi‑task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non‑semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy‑Constrained Video Coding framework for Machines (SEC‑VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS‑irrelevant information. Specifically, a bi‑directional entropy‑constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic‑pixel dual‑path fusion (SPDF) module injects pixel‑level priors into the final reconstruction. Through semantic‑pixel fusion, it suppresses artifacts harmful to MVS and improves machine‑oriented reconstruction quality. Experimental results show our framework achieves state‑of‑the‑art~(SOTA) in rate‑task performance, with significant bitrate savings over VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), and multiple object tracking (44.9%). We will release our code soon.

Abstract:
Vision‑language pre‑training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification, retrieval, and segmentation. In the 3D medical image domain, these capabilities allow vision‑language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities, predicting likelihoods of abnormality, or, with downstream adaptation, generating radiological reports. While the methodology holds promise, data availability and domain‑specific hurdles limit the capabilities of current 3D VLEs. In this paper, we overcome these challenges by injecting additional supervision via a report generation objective and combining vision‑language with vision‑only pre‑training. This allows us to leverage both image‑only and paired image‑text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional objectives, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language‑Image Pre‑training (COLIPRI) encoder family. Our COLIPRI encoders achieve state‑of‑the‑art performance in report generation, semantic segmentation, classification probing, and zero‑shot classification. The model is available at https://huggingface.co/microsoft/colipri.

Abstract:
Scaling up model size and training data has advanced foundation models for instance‑level perception, achieving state‑of‑the‑art in‑domain and zero‑shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource‑constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto‑optimal downscaling to support deployment across devices ranging from high‑end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi‑scale and multi‑modal fusion, (ii) a language‑guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state‑of‑the‑art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high‑performance computing platforms and mobile devices.

Abstract:
The real‑world is inherently multi‑modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb ‑> semantic or text ‑> sentiment_class). Recent trends go towards bi‑modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre‑trained experts and procedural combinations between them on top of raw videos using a fully autonomous data‑pipeline, which we also open‑source. We then make use of PHG‑MAE, a model specifically designed to leverage multi‑modal data. We show that this model which was efficiently distilled into a low‑parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use‑case of real‑time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off‑the‑shelf models using the same framework, such as DPT for near real‑time depth estimation.

Abstract:
Many network architectures exist for learning on meshes, yet their constructions entail delicate trade‑offs between difficulty learning high‑frequency features, insufficient receptive field, sensitivity to discretization, and inefficient computational overhead. Drawing from classic local‑global approaches in mesh processing, we introduce PoissonNet, a novel neural architecture that overcomes all of these deficiencies by formulating a local‑global learning scheme, which uses Poisson's equation as the primary mechanism for feature propagation. Our core network block is simple; we apply learned local feature transformations in the gradient domain of the mesh, then solve a Poisson system to propagate scalar feature updates across the surface globally. Our local‑global learning framework preserves the features's full frequency spectrum and provides a truly global receptive field, while remaining agnostic to mesh triangulation. Our construction is efficient, requiring far less compute overhead than comparable methods, which enables scalability ‑‑ both in the size of our datasets, and the size of individual training samples. These qualities are validated on various experiments where, compared to previous intrinsic architectures, we attain state‑of‑the‑art performance on semantic segmentation and parameterizing highly‑detailed animated surfaces. Finally, as a central application of PoissonNet, we show its ability to learn deformations, significantly outperforming state‑of‑the‑art architectures that learn on surfaces.

Abstract:
In this paper, we focus on Novel Class Discovery for Point Cloud Segmentation (3D‑NCD), aiming to learn a model that can segment unlabeled (novel) 3D classes using only the supervision from labeled (base) 3D classes. The key to this task is to setup the exact correlations between the point representations and their base class labels, as well as the representation correlations between the points from base and novel classes. A coarse or statistical correlation learning may lead to the confusion in novel class inference. lf we impose a causal relationship as a strong correlated constraint upon the learning process, the essential point cloud representations that accurately correspond to the classes should be uncovered. To this end, we introduce a structural causal model (SCM) to re‑formalize the 3D‑NCD problem and propose a new method, i.e., Joint Learning of Causal Representation and Reasoning. Specifically, we first analyze hidden confounders in the base class representations and the causal relationships between the base and novel classes through SCM. We devise a causal representation prototype that eliminates confounders to capture the causal representations of base classes. A graph structure is then used to model the causal relationships between the base classes' causal representation prototypes and the novel class prototypes, enabling causal reasoning from base to novel classes. Extensive experiments and visualization results on 3D and 2D NCD semantic segmentation demonstrate the superiorities of our method.

Abstract:
The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large‑scale datasets with accurate annotations. However, collecting and annotating real‑world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: 1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; 2) Depth maps for real samples computed via state‑of‑the‑art monocular depth estimation; 3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; 4) Studies on synthetic‑to‑real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV‑based 3D urban scene understanding.

Abstract:
We propose two novel loss functions, Multiplicative Loss and Confidence‑Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence‑Adaptive Multiplicative Loss applies a confidence‑driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter‑free mechanism for robust segmentation under challenging data limitations.

Abstract:
3D instance segmentation is crucial for understanding complex 3D environments, yet fully supervised methods require dense point‑level annotations, resulting in substantial annotation costs and labor overhead. To mitigate this, box‑level annotations have been explored as a weaker but more scalable form of supervision. However, box annotations inherently introduce ambiguity in overlapping regions, making accurate point‑to‑instance assignment challenging. Recent methods address this ambiguity by generating pseudo‑masks through training a dedicated pseudo‑labeler in an additional training stage. However, such two‑stage pipelines often increase overall training time and complexity, hinder end‑to‑end optimization. To overcome these challenges, we propose BEEP3D‑Box‑supervised End‑to‑End Pseudo‑mask generation for 3D instance segmentation. BEEP3D adopts a student‑teacher framework, where the teacher model serves as a pseudo‑labeler and is updated by the student model via an Exponential Moving Average. To better guide the teacher model to generate precise pseudo‑masks, we introduce an instance center‑based query refinement that enhances position query localization and leverages features near instance centers. Additionally, we design two novel losses‑query consistency loss and masked feature consistency loss‑to align semantic and geometric signals between predictions and pseudo‑masks. Extensive experiments on ScanNetV2 and S3DIS datasets demonstrate that BEEP3D achieves competitive or superior performance compared to state‑of‑the‑art weakly supervised methods while remaining computationally efficient.

Abstract:
Synthetic datasets are widely used for training urban scene recognition models, but even highly realistic renderings show a noticeable gap to real imagery. This gap is particularly pronounced when adapting to a specific target domain, such as Cityscapes, where differences in architecture, vegetation, object appearance, and camera characteristics limit downstream performance. Closing this gap with more detailed 3D modelling would require expensive asset and scene design, defeating the purpose of low‑cost labelled data. To address this, we present a new framework that adapts an off‑the‑shelf diffusion model to a target domain using only imperfect pseudo‑labels. Once trained, it generates high‑fidelity, target‑aligned images from semantic maps of any synthetic dataset, including low‑effort sources created in hours rather than months. The method filters suboptimal generations, rectifies image‑label misalignments, and standardises semantics across datasets, transforming weak synthetic data into competitive real‑domain training sets. Experiments on five synthetic datasets and two real target datasets show segmentation gains of up to +8.0%pt. mIoU over state‑of‑the‑art translation methods, making rapidly constructed synthetic datasets as effective as high‑effort, time‑intensive synthetic datasets requiring extensive manual design. This work highlights a valuable collaborative paradigm where fast semantic prototyping, combined with generative models, enables scalable, high‑quality training data creation for urban scene understanding.

Abstract:
We propose to build realistic virtual worlds, called 360RVW, for large urban environments directly from 360° videos. We provide an interface for interactive exploration, where users can freely navigate via their own avatars. 360° videos record the entire environment of the shooting location simultaneously leading to highly realistic and immersive representations. Our system uses 360° videos recorded along streets and builds a 360RVW through four main operations: video segmentation by intersection detection, video completion to remove the videographer, semantic segmentation for virtual collision detection with the avatar, and projection onto a distorted sphere that moves along the camera trajectory following the avatar's movements. Our interface allows users to explore large urban environments by changing their walking direction at intersections or choosing a new location by clicking on a map. Even without a 3D model, the users can experience collision with buildings using metadata produced by semantic segmentation. Furthermore, we stream the 360° videos so users can directly access 360RVW via their web browser. We fully evaluate our system, including a perceptual experiment comparing our approach to previous exploratory interfaces. The results confirm the quality of our system, especially regarding the presence of users and the interactive exploration, making it most suitable for a virtual tour of urban environments.

Abstract:
Generative Models are a valuable tool for the controlled creation of high‑quality image data. Controlled diffusion models like the ControlNet have allowed the creation of labeled distributions. Such synthetic datasets can augment the original training distribution when discriminative models, like semantic segmentation, are trained. However, this augmentation effect is limited since ControlNets tend to reproduce the original training distribution. This work introduces a method to utilize data from unlabeled domains to train ControlNets by introducing the concept of uncertainty into the control mechanism. The uncertainty indicates that a given image was not part of the training distribution of a downstream task, e.g., segmentation. Thus, two types of control are engaged in the final network: an uncertainty control from an unlabeled dataset and a semantic control from the labeled dataset. The resulting ControlNet allows us to create annotated data with high uncertainty from the target domain, i.e., synthetic data from the unlabeled distribution with labels. In our scenario, we consider retinal OCTs, where typically high‑quality Spectralis images are available with given ground truth segmentations, enabling the training of segmentation networks. The recent development in Home‑OCT devices, however, yields retinal OCTs with lower quality and a large domain shift, such that out‑of‑the‑pocket segmentation networks cannot be applied for this type of data. Synthesizing annotated images from the Home‑OCT domain using the proposed approach closes this gap and leads to significantly improved segmentation results without adding any further supervision. The advantage of uncertainty‑guidance becomes obvious when compared to style transfer: it enables arbitrary domain shifts without any strict learning of an image style. This is also demonstrated in a traffic scene experiment.

Authors: Chang Liu, Henghui Ding, Kaining Ying, Lingyi Hong, Ning Xu, Linjie Yang, Yuchen Fan, Mingqi Gao, Jingkun Chen, Yunqi Miao, Gengshen Wu, Zhijin Qin, Jungong Han, Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Chang Soo Lim, Joonyoung Moon, Donghyeon Cho, Tingmin Li, Yixuan Li, Yang Yang, An Yan, Leilei Cao, Feng Lu, Ran Hong, Youhai Jiang, Fengjie Zhu, Yujie Xie, Hongyang Zhang, Zhihui Liu, Shihai Ruan, Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji, Ran Hong, Feng Lu, Leilei Cao, An Yan, Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe

Abstract:
This report presents an overview of the 7th Large‑scale Video Object Segmentation (LSVOS) Challenge held in conjunction with ICCV 2025. Besides the two traditional tracks of LSVOS that jointly target robustness in realistic video scenarios: Classic VOS (VOS), and Referring VOS (RVOS), the 2025 edition features a newly introduced track, Complex VOS (MOSEv2). Building upon prior insights, MOSEv2 substantially increases difficulty, introducing more challenging but realistic scenarios including denser small objects, frequent disappear/reappear events, severe occlusions, adverse weather and lighting, etc., pushing long‑term consistency and generalization beyond curated benchmarks. The challenge retains standard J, F, and J\&F metrics for VOS and RVOS, while MOSEv2 adopts J\&\dotF as the primary ranking metric to better evaluate objects across scales and disappearance cases. We summarize datasets and protocols, highlight top‑performing solutions, and distill emerging trends, such as the growing role of LLM/MLLM components and memory‑aware propagation, aiming to chart future directions for resilient, language‑aware video segmentation in the wild.

Abstract:
Recently, the powerful generalization ability exhibited by foundation models has brought forth new solutions for zero‑shot anomaly segmentation tasks. However, guiding these foundation models correctly to address downstream tasks remains a challenge. This paper proposes a novel two‑stage framework, for zero‑shot anomaly segmentation tasks in industrial anomaly detection. This framework excellently leverages the powerful anomaly localization capability of CLIP and the boundary perception ability of SAM.(1) To mitigate SAM's inclination towards object segmentation, we propose the Co‑Feature Point Prompt Generation (PPG) module. This module collaboratively utilizes CLIP and SAM to generate positive and negative point prompts, guiding SAM to focus on segmenting anomalous regions rather than the entire object. (2) To further optimize SAM's segmentation results and mitigate rough boundaries and isolated noise, we introduce the Cascaded Prompts for SAM (CPS) module. This module employs hybrid prompts cascaded with a lightweight decoder of SAM, achieving precise segmentation of anomalous regions. Across multiple datasets, consistent experimental validation demonstrates that our approach achieves state‑of‑the‑art zero‑shot anomaly segmentation results. Particularly noteworthy is our performance on the Visa dataset, where we outperform the state‑of‑the‑art methods by 10.3% and 7.7% in terms of F_1‑max and AP metrics, respectively.

Abstract:
Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task‑relevant semantics needed for safe navigation and goal‑directed action. We ask whether off‑the‑shelf pretrained generative vision models can derive this missing structure as zero‑shot offline priors for robot reasoning. Such priors should support spatio‑semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM‑guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D‑derived benchmark of doorway‑occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object‑reaching tasks. Our results suggest that useful spatio‑semantic priors for planning can be derived without problem‑specific fine‑tuning.

Abstract:
Real‑world point cloud datasets have made significant contributions to the development of LiDAR‑based perception technologies, such as object segmentation for autonomous driving. However, due to the limited number of instances in some rare classes, the long‑tail problem remains a major challenge in existing datasets. To address this issue, we introduce a novel, synthetic point cloud dataset named RareBoost3D, which complements existing real‑world datasets by providing significantly more instances for object classes that are rare in real‑world datasets. To effectively leverage both synthetic and real‑world data, we further propose a cross‑domain semantic alignment method named CSC loss that aligns feature representations of the same class across different domains. Experimental results demonstrate that this alignment significantly enhances the performance of LiDAR point cloud segmentation models over real‑world data.

Abstract:
Environmental perception systems are crucial for high‑precision mapping and autonomous navigation, with LiDAR serving as a core sensor providing accurate 3D point cloud data. Efficiently processing unstructured point clouds while extracting structured semantic information remains a significant challenge. In recent years, numerous pseudo‑image‑based representation methods have emerged to balance efficiency and performance by fusing 3D point clouds with 2D grids. However, the fundamental inconsistency between the pseudo‑image representation and the original 3D information critically undermines 2D‑3D feature fusion, posing a primary obstacle for coherent information fusion and leading to poor feature discriminability. This work proposes DAGLFNet, a pseudo‑image‑based semantic segmentation framework designed to extract discriminative features. It incorporates three key components: first, a Global‑Local Feature Fusion Encoding (GL‑FFE) module to enhance intra‑set local feature correlation and capture global contextual information; second, a Multi‑Branch Feature Extraction (MB‑FE) network to capture richer neighborhood information and improve the discriminability of contour features; and third, a Feature Fusion via Deep Feature‑guided Attention (FFDFA) mechanism to refine cross‑channel feature fusion precision. Experimental evaluations demonstrate that DAGLFNet achieves mean Intersection‑over‑Union (mIoU) scores of 69.9% and 78.7% on the validation sets of SemanticKITTI and nuScenes, respectively. The method achieves an excellent balance between accuracy and efficiency.

Abstract:
This study explores the application of deep learning techniques in the automated detection and segmentation of brain tumors from MRI scans. We employ several machine learning models, including basic logistic regression, Convolutional Neural Networks (CNNs), and Residual Networks (ResNet) to classify brain tumors effectively. Additionally, we investigate the use of U‑Net for semantic segmentation and EfficientDet for anchor‑based object detection to enhance the localization and identification of tumors. Our results demonstrate promising improvements in the accuracy and efficiency of brain tumor diagnostics, underscoring the potential of deep learning in medical imaging and its significance in improving clinical outcomes.

Abstract:
Semantic segmentation is essential for automating remote sensing analysis in fields like ecology. However, fine‑grained analysis of complex aerial or underwater imagery remains an open challenge, even for state‑of‑the‑art models. Progress is frequently hindered by the high cost of obtaining the dense, expert‑annotated labels required for model supervision. While sparse point‑labels are easier to obtain, they introduce challenges regarding which points to annotate and how to propagate the sparse information. We present SSeg, a novel framework that addresses both issues. SSeg first employs an active sampling strategy to guide annotators, maximizing the value of their point labels. Then, it propagates these sparse labels with a hybrid approach leveraging both the best of SAM2 and superpixel‑based methods. Experiments on two diverse monitoring datasets demonstrate SSeg's benefits over state‑of‑the‑art approaches. Our main contribution is a simple but effective interactive annotation tool integrating our algorithms. It enables ecology researchers to leverage foundation models and computer vision to efficiently generate high‑quality segmentation masks to process their data.

Abstract:
Tracking the spatiotemporal evolution of large‑scale landslide scars is critical for understanding the evolution mechanisms and failure precursors, enabling effective early‑warning. However, most existing studies have focused on single‑phase or pre‑ and post‑failure dual‑phase landslide identification. Although these approaches delineate post‑failure landslide boundaries, it is challenging to track the spatiotemporal evolution of landslide scars. To address this problem, this study proposes a novel and universal framework for tracking the spatiotemporal evolution of large‑scale landslide scars using a vision foundation model. The key idea behind the proposed framework is to reconstruct discrete optical remote sensing images into a continuous video sequence. This transformation enables a vision foundation model, which is developed for video segmentation, to be used for tracking the evolution of landslide scars. The proposed framework operates within a knowledge‑guided, auto‑propagation, and interactive refinement paradigm to ensure the continuous and accurate identification of landslide scars. The proposed framework was validated through application to two representative cases: the post‑failure Baige landslide and the active Sela landslide (2017‑2025). Results indicate that the proposed framework enables continuous tracking of landslide scars, capturing both failure precursors critical for early warning and post‑failure evolution essential for assessing secondary hazards and long‑term stability.

Abstract:
Online sensing plays an important role in advancing modern manufacturing. The real‑time sensor signals, which can be stored as high‑resolution time series data, contain rich information about the operation status. One of its popular usages is online process monitoring, which can be achieved by effective anomaly detection from the sensor signals. However, most existing approaches either heavily rely on labeled data for training supervised models, or are designed to detect only extreme outliers, thus are ineffective at identifying subtle semantic off‑track anomalies to capture where new regimes or unexpected routines start. To address this challenge, we propose an matrix profile‑based unsupervised anomaly detection algorithm that captures fabrication cycle similarity and performs semantic segmentation to precisely identify the onset of defect anomalies in additive manufacturing. The effectiveness of the proposed method is demonstrated by the experiments on real‑world sensor data.

Abstract:
State‑of‑the‑art (SOTA) methods for cell instance segmentation are based on deep learning (DL) semantic segmentation approaches, focusing on distinguishing foreground pixels from background pixels. In order to identify cell instances from foreground pixels (e.g., pixel clustering), most methods decompose instance information into pixel‑wise objectives, such as distances to foreground‑background boundaries (distance maps), heat gradients with the center point as heat source (heat diffusion maps), and distances from the center point to foreground‑background boundaries with fixed angles (star‑shaped polygons). However, pixel‑wise objectives may lose significant geometric properties of the cell instances, such as shape, curvature, and convexity, which require a collection of pixels to represent. To address this challenge, we present a novel pixel clustering method, called Ceb (for Cell boundaries), to leverage cell boundary features and labels to divide foreground pixels into cell instances. Starting with probability maps generated from semantic segmentation, Ceb first extracts potential foreground‑foreground boundaries with a revised Watershed algorithm. For each boundary candidate, a boundary feature representation (called boundary signature) is constructed by sampling pixels from the current foreground‑foreground boundary as well as the neighboring background‑foreground boundaries. Next, a boundary classifier is used to predict its binary boundary label based on the corresponding boundary signature. Finally, cell instances are obtained by dividing or merging neighboring regions based on the predicted boundary labels. Extensive experiments on six datasets demonstrate that Ceb outperforms existing pixel clustering methods on semantic segmentation probability maps. Moreover, Ceb achieves highly competitive performance compared to SOTA cell instance segmentation methods.

Abstract:
Nuclei instance segmentation in pathological images is crucial for downstream tasks such as tumor microenvironment analysis. However, the high cost and scarcity of annotated data limit the applicability of fully supervised methods, while existing semi‑supervised methods fail to adequately regularize consistency at the instance level, lack leverage of the inherent prior knowledge of pathological structures, and are prone to introducing noisy pseudo‑labels during training. In this paper, we propose an Instance‑Aware Robust Consistency Regularization Network (IRCR‑Net) for accurate instance‑level nuclei segmentation. Specifically, we introduce the Matching‑Driven Instance‑Aware Consistency (MIAC) and Prior‑Driven Instance‑Aware Consistency (PIAC) mechanisms to refine the nuclei instance segmentation result of the teacher and student subnetwork, particularly for densely distributed and overlapping nuclei. We incorporate morphological prior knowledge of nuclei in pathological images and utilize these priors to assess the quality of pseudo‑labels generated from unlabeled data. Low‑quality pseudo‑labels are discarded, while high‑quality predictions are enhanced to reduce pseudo‑label noise and benefit the network's robust training. Experimental results demonstrate that the proposed method significantly enhances semi‑supervised nuclei instance segmentation performance across multiple public datasets compared to existing approaches, even surpassing fully supervised methods in some scenarios.

Abstract:
Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re‑annotation, domain generalization in LiDAR‑based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy‑label learning is well‑studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS‑NL) and establish the first benchmark by adapting three representative noisy‑label learning strategies from image classification to 3D segmentation. However, we find that existing noisy‑label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual‑view framework with strong and weak branches that enforce feature‑level consistency and apply cross‑entropy loss based on confidence‑aware filtering of predictions. Our approach shows state‑of‑the‑art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS‑NL tasks. The code is available on our project page.

Abstract:
Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2‑3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter‑slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self‑supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2‑3dMed significantly outperforms state‑of‑the‑art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video‑centric foundation models to spatial volumetric data.

Abstract:
While Reinforcement Learning with Verifiable Rewards (RLVR) enhances complex reasoning in LLMs, current methods struggle to balance exploration and exploitation. This leads to critical issues like inaccurate credit assignment for intermediate steps and premature entropy collapse, limiting model performance. To address this, we introduce Attribution‑based Contribution to Policy Optimization (ACPO), a phased framework that incorporates a difficulty‑aware curriculum. ACPO improves exploration by using trajectory semantic segmentation and an attribution‑based representation to dynamically regulate policy entropy, thus mitigating its collapse. Concurrently, it enhances exploitation with a factorized reward system that precisely quantifies the hierarchical contribution of each reasoning step, ensuring accurate credit assignment. Extensive experiments on challenging benchmarks, including AIME, MATH, and AMC, demonstrate that ACPO significantly outperforms existing state‑of‑the‑art approaches.

Abstract:
Open‑vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB‑D images, and then employ vision‑language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open‑vocabulary 3D instance segmentation method via Label‑guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high‑quality instance embeddings and distills its open‑vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label‑guided distillation algorithm to distill open‑vocabulary knowledge from label‑consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state‑of‑the‑art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.

Abstract:
In addition to accurate scene understanding through precise semantic segmentation of LiDAR point clouds, detecting out‑of‑distribution (OOD) objects, instances not encountered during training, is essential to prevent the incorrect assignment of unknown objects to known classes. While supervised OOD detection methods depend on auxiliary OOD datasets, unsupervised methods avoid this requirement but typically rely on predictive entropy, the entropy of the predictive distribution obtained by averaging over an ensemble or multiple posterior weight samples. However, these methods often conflate epistemic (model) and aleatoric (data) uncertainties, misclassifying ambiguous in distribution regions as OOD. To address this issue, we present an unsupervised OOD detection approach that employs epistemic uncertainty derived from hierarchical Bayesian modeling of Gaussian Mixture Model (GMM) parameters in the feature space of a deep neural network. Without requiring auxiliary data or additional training stages, our approach outperforms existing uncertainty‑based methods on the SemanticKITTI dataset, achieving an 18% improvement in AUROC, 22% increase in AUPRC, and 36% reduction in FPR95 (from 76% to 40%), compared to the predictive entropy approach used in prior works.

Abstract:
Affordance segmentation aims to decompose 3D objects into parts that serve distinct functional roles, enabling models to reason about object interactions rather than mere recognition. Existing methods, mostly following the paradigm of 3D semantic segmentation or prompt‑based frameworks, struggle when geometric cues are weak or ambiguous, as sparse point clouds provide limited functional information. To overcome this limitation, we leverage the rich semantic knowledge embedded in large‑scale 2D Vision Foundation Models (VFMs) to guide 3D representation learning through a cross‑modal alignment mechanism. Specifically, we propose Cross‑Modal Affinity Transfer (CMAT), a pretraining strategy that compels the 3D encoder to align with the semantic structures induced by lifted 2D features. CMAT is driven by a core affinity alignment objective, supported by two auxiliary losses, geometric reconstruction and feature diversity, which together encourage structured and discriminative feature learning. Built upon the CMAT‑pretrained backbone, we employ a lightweight affordance segmentor that injects text or visual prompts into the learned 3D space through an efficient cross‑attention interface, enabling dense and prompt‑aware affordance prediction while preserving the semantic organization established during pretraining. Extensive experiments demonstrate consistent improvements over previous state‑of‑the‑art methods in both accuracy and efficiency.

Abstract:
Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long‑range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long‑range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state‑of‑the‑art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end‑to‑end training with dense mask annotations, which could be computation‑consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image‑based foundation segmentation models to referring video object segmentation, we leverage off‑the‑shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high‑quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image‑based foundation segmentation models, we would be able to produce high‑quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.

Abstract:
Persistent dynamic scene modeling for tracking and novel‑view synthesis remains challenging due to the difficulty of capturing accurate deformations while maintaining computational efficiency. We propose SCas4D, a cascaded optimization framework that leverages structural patterns in 3D Gaussian Splatting for dynamic scenes. The key idea is that real‑world deformations often exhibit hierarchical patterns, where groups of Gaussians share similar transformations. By progressively refining deformations from coarse part‑level to fine point‑level, SCas4D achieves convergence within 100 iterations per time frame and produces results comparable to existing methods with only one‑twentieth of the training iterations. The approach also demonstrates effectiveness in self‑supervised articulated object segmentation, novel view synthesis, and dense point tracking tasks.

Abstract:
Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; however, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. To address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi‑modal light field point‑cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential reconstruction of point‑cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image‑only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud‑only segmentation by 2.38 mIoU, demonstrating its effectiveness.

Abstract:
We present DropD‑SLAM, a real‑time monocular SLAM system that achieves RGB‑D‑level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB‑D SLAM back end for tracking and mapping. On the TUM RGB‑D benchmark, DropD‑SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state‑of‑the‑art RGB‑D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real‑time sources of metric scale, marking a step toward simpler and more cost‑effective SLAM systems.

Authors: J. Schueler, H. M. Araújo, S. N. Balashov, J. E. Borg, C. Brew, F. M. Brunbauer, C. Cazzaniga, A. Cottle, D. Edgeman, C. D. Frost, F. Garcia, D. Hunt, M. Kastriotou, P. Knights, H. Kraus, A. Lindote, M. Lisowska, D. Loomba, E. Lopez Asamar, P. A. Majewski, T. Marley, C. McCabe, L. Millins, R. Nandakumar, T. Neep, F. Neves, K. Nikolopoulos, E. Oliveri, A. Roy, T. J. Sumner, E. Tilly, W. Thompson, M. A. Vogiatzi

Abstract:
The separation of overlapping objects presents a significant challenge in scientific imaging. While deep learning segmentation‑regression algorithms can predict pixel‑wise intensities, they typically treat all regions equally rather than prioritizing overlap regions where attribution is most ambiguous. Recent advances in instance segmentation show that weighting regions of pixel overlap in training can improve segmentation boundary predictions in regions of overlap, but this idea has not yet been extended to segmentation regression. We address this with Overlap‑Aware Segmentation of ImageS (OASIS): a new segmentation‑regression framework with a weighted loss function designed to prioritize regions of object‑overlap during training, enabling extraction of pixel intensities and topological features from heavily obscured objects. We demonstrate OASIS in the context of the MIGDAL experiment, which aims to directly image the Migdal effect‑‑a rare process where electron emission is induced by nuclear scattering‑‑in a low‑pressure optical time projection chamber. This setting poses an extreme test case, as the target for reconstruction is a faint electron recoil track which is often heavily‑buried within the order(s)‑of‑magnitude brighter nuclear recoil track. Compared to unweighted segmentation regression, we demonstrate OASIS's novel overlap region‑targeted loss function weight to be the single most important training weight for improving intensity and topological reconstructions of the low‑energy electron tracks that tend to be most dominated by pixel overlap. Averaging over eight training campaigns, we further show the addition of overlap‑targeted weights to improve median intensity reconstruction errors from ‑41.1% to ‑13.3% for these low‑energy electrons. These performance gains demonstrate OASIS as a generalizable methodology for recovering obscured signals in overlap‑dominated regions.

Abstract:
Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate‑then‑segment' pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine‑grained pixel control, text‑video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language‑guided deformation from a video's holistic representation to its target mask. Our one‑stage, generative approach achieves new state‑of‑the‑art results across all major RVOS benchmarks. Specifically, achieving a J&F of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref‑DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

Abstract:
The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time‑consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high‑quality pseudo‑labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo‑labels. We then refine these labels through a dedicated spatio‑temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior‑based losses that inject visual knowledge into the 3D network, and a novel prototype‑based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state‑of‑the‑art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground‑truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

Abstract:
Generating enough and diverse data through augmentation offers an efficient solution to the time‑consuming and labour‑intensive process of collecting and annotating pixel‑wise images. Traditional data augmentation techniques often face challenges in manipulating high‑level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text‑to‑image or image‑to‑image transformation. However, existing diffusion‑based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training‑free pipeline that integrates pretrained ControlNet and Vision‑Language Models (VLMs) to generate synthetic images paired with pixel‑level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi‑way Prompt Generator, Mask Generator and High‑quality Image Selection module. Our results on PASCAL‑5i and COCO‑20i present promising performance and outperform concurrent work for one‑shot semantic segmentation.

Abstract:
Object recognition and motion understanding are key components of perception that complement each other. While self‑supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self‑supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top‑down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi‑object scenes of natural videos. We demonstrate that after pretraining on two large‑scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self‑supervised learning methods. We also show that Midway Network's learned dynamics can capture high‑level correspondence via a novel analysis method based on forward feature perturbation.

Abstract:
Few‑shot semantic segmentation is vital for deep learning‑based infrastructure inspection applications, where labeled training examples are scarce and expensive. Although existing deep learning frameworks perform well, the need for extensive labeled datasets and the inability to learn new defect categories with little data are problematic. We present our Enhanced Feature Pyramid Network (E‑FPN) framework for few‑shot semantic segmentation of culvert and sewer defect categories using a prototypical learning framework. Our approach has three main contributions: (1) adaptive E‑FPN encoder using InceptionSepConv blocks and depth‑wise separable convolutions for efficient multi‑scale feature extraction; (2) prototypical learning with masked average pooling for powerful prototype generation from small support examples; and (3) attention‑based feature representation through global self‑attention, local self‑attention and cross‑attention. Comprehensive experimentation on challenging infrastructure inspection datasets illustrates that the method achieves excellent few‑shot performance, with the best configuration being 8‑way 5‑shot training configuration at 82.55% F1‑score and 72.26% mIoU in 2‑way classification testing. The self‑attention method had the most significant performance improvements, providing 2.57% F1‑score and 2.9% mIoU gain over baselines. Our framework addresses the critical need to rapidly respond to new defect types in infrastructure inspection systems with limited new training data that lead to more efficient and economical maintenance plans for critical infrastructure systems.

Abstract:
Infrared Small Target Detection (IRSTD) is a challenging task in defense applications, where complex backgrounds and tiny target sizes often result in numerous false alarms using conventional object detectors. To overcome this limitation, we propose Anomaly‑Aware YOLO (AA‑YOLO), which integrates a statistical anomaly detection test into its detection head. By treating small targets as unexpected patterns against the background, AA‑YOLO effectively controls the false alarm rate. Our approach not only achieves competitive performance on several IRSTD benchmarks, but also demonstrates remarkable robustness in scenarios with limited training data, noise, and domain shifts. Furthermore, since only the detection head is modified, our design is highly generic and has been successfully applied across various YOLO backbones, including lightweight models. It also provides promising results when integrated into an instance segmentation YOLO. This versatility makes AA‑YOLO an attractive solution for real‑world deployments where resources are constrained. The code will be publicly released.

Abstract:
We propose a simple, data‑efficient pipeline that augments an implicit reconstruction network based on neural SDF‑based CAD parts with a part‑segmentation head trained under PartField‑generated supervision. Unlike methods tied to fixed taxonomies, our model accepts meshes with any number of parts and produces coherent, geometry‑aligned labels in a single pass. We evaluate on randomly sampled CAD meshes from the ABC dataset with intentionally varied part cardinalities, including over‑segmented shapes, and report strong performance across reconstruction (CDL1/CDL2, F1‑micro, NC) and segmentation (mIoU, Accuracy), together with a new Segmentation Consistency metric that captures local label smoothness. We attach a lightweight segmentation head to the Flat‑CAD SDF trunk; on a paired evaluation it does not alter reconstruction while providing accurate part labels for meshes with any number of parts. Even under degraded reconstructions on thin or intricate geometries, segmentation remains accurate and label‑coherent, often preserving the correct part count. Our approach therefore offers a practical route to semantically structured CAD meshes without requiring curated taxonomies or exact palette matches. We discuss limitations in boundary precision, partly due to per‑face supervision, and outline paths toward boundary‑aware training and higher resolution labels.

Abstract:
Robots with internal visual self‑models promise unprecedented adaptability, yet existing autonomous modeling pipelines remain fragile under realistic sensing conditions such as noisy imagery and cluttered backgrounds. This paper presents the first systematic study quantifying how visual degradations‑‑including blur, salt‑and‑pepper noise, and Gaussian noise‑‑affect robotic self‑modeling. Through both simulation and physical experiments, we demonstrate their impact on morphology prediction, trajectory planning, and damage recovery in state‑of‑the‑art pipelines. To overcome these challenges, we introduce a task‑aware denoising framework that couples classical restoration with morphology‑preserving constraints, ensuring retention of structural cues critical for self‑modeling. In addition, we integrate semantic segmentation to robustly isolate robots from cluttered and colorful scenes. Extensive experiments show that our approach restores near‑baseline performance across simulated and physical platforms, while existing pipelines degrade significantly. These contributions advance the robustness of visual self‑modeling and establish practical foundations for deploying self‑aware robots in unpredictable real‑world environments.

Abstract:
The generalization of deep neural networks to unknown domains is a major challenge despite their tremendous progress in recent years. For this reason, the dynamic area of domain generalization (DG) has emerged. In contrast to unsupervised domain adaptation, there is no access to or knowledge about the target domains, and DG methods aim to generalize across multiple different unseen target domains. Domain generalization is particularly relevant for the task semantic segmentation which is used in several areas such as biomedicine or automated driving. This survey provides a comprehensive overview of the rapidly evolving topic of domain generalized semantic segmentation. We cluster and review existing approaches and identify the paradigm shift towards foundation‑model‑based domain generalization. Finally, we provide an extensive performance comparison of all approaches, which highlights the significant influence of foundation models on domain generalization. This survey seeks to advance domain generalization research and inspire scientists to explore new research directions.

Abstract:
Supervised deep learning for land cover semantic segmentation (LCS) relies on labeled satellite data. However, most existing Sentinel‑2 datasets are cloud‑free, which limits their usefulness in tropical regions where clouds are common. To properly evaluate the extent of this problem, we developed a cloud injection algorithm that simulates realistic cloud cover, allowing us to test how Sentinel‑1 radar data can fill in the gaps caused by cloud‑obstructed optical imagery. We also tackle the issue of losing spatial and/or spectral details during encoder downsampling in deep networks. To mitigate this loss, we propose a lightweight method that injects Normalized Difference Indices (NDIs) into the final decoding layers, enabling the model to retain key spatial features with minimal additional computation. Injecting NDIs enhanced land cover segmentation performance on the DFC2020 dataset, yielding improvements of 1.99% for U‑Net and 2.78% for DeepLabV3 on cloud‑free imagery. Under cloud‑covered conditions, incorporating Sentinel‑1 data led to significant performance gains across all models compared to using optical data alone, highlighting the effectiveness of radar‑optical fusion in challenging atmospheric scenarios.

Abstract:
Detecting unknown objects in semantic segmentation is crucial for safety‑critical applications such as autonomous driving. Large vision foundation models, including DINOv2, InternImage, and CLIP, have advanced visual representation learning by providing rich features that generalize well across diverse tasks. While their strength in closed‑set semantic tasks is established, their capability to detect out‑of‑distribution (OoD) regions in semantic segmentation remains underexplored. In this work, we investigate whether foundation models fine‑tuned on segmentation datasets can inherently distinguish in‑distribution (ID) from OoD regions without any outlier supervision. We propose a simple, training‑free approach that utilizes features from the InternImage backbone and applies K‑Means clustering alongside confidence thresholding on raw decoder logits to identify OoD clusters. Our method achieves 50.02 Average Precision on the RoadAnomaly benchmark and 48.77 on the benchmark of ADE‑OoD with InternImage‑L, surpassing several supervised and unsupervised baselines. These results suggest a promising direction for generic OoD segmentation methods that require minimal assumptions or additional data.

Abstract:
Simultaneous Localization and Mapping (SLAM) plays an important role in many robotics fields, including social robots. Many of the available visual SLAM methods are based on the assumption of a static world and struggle in dynamic environments. In the current study, we introduce a real‑time semantic RGBD SLAM approach designed specifically for dynamic environments. Our proposed system can effectively detect moving objects and maintain a static map to ensure robust camera tracking. The key innovation of our approach is the incorporation of deep learning‑based semantic information into SLAM systems to mitigate the impact of dynamic objects. Additionally, we enhance the semantic segmentation process by integrating an Extended Kalman filter to identify dynamic objects that may be temporarily idle. We have also implemented a generative network to fill in the missing regions of input images belonging to dynamic objects. This highly modular framework has been implemented on the ROS platform and can achieve around 22 fps on a GTX1080. Benchmarking the developed pipeline on dynamic sequences from the TUM dataset suggests that the proposed approach delivers competitive localization error in comparison with the state‑of‑the‑art methods, all while operating in near real‑time. The source code is publicly available.

Abstract:
The integration of electric vehicles (EVs) into smart grids presents unique opportunities to enhance both transportation systems and energy networks. However, ensuring safe and interpretable interactions between drivers, vehicles, and the surrounding environment remains a critical challenge. This paper presents a multi‑modal large language model (LLM)‑based framework to process multimodal sensor data ‑ such as object detection, semantic segmentation, and vehicular telemetry ‑ and generate natural‑language alerts for drivers. The framework is validated using real‑world data collected from instrumented vehicles driving on urban roads, ensuring its applicability to real‑world scenarios. By combining visual perception (YOLOv8), geocoded positioning, and CAN bus telemetry, the framework bridges raw sensor data and driver comprehension, enabling safer and more informed decision‑making in urban driving scenarios. Case studies using real data demonstrate the framework's effectiveness in generating context‑aware alerts for critical situations, such as proximity to pedestrians, cyclists, and other vehicles. This paper highlights the potential of LLMs as assistive tools in e‑mobility, benefiting both transportation systems and electric networks by enabling scalable fleet coordination, EV load forecasting, and traffic‑aware energy planning. Index Terms ‑ Electric vehicles, visual perception, large language models, YOLOv8, semantic segmentation, CAN bus, prompt engineering, smart grid.

Abstract:
Federeated Learning (FL) offers a privacy‑preserving solution for Semantic Segmentation (SS) tasks to adapt to new domains, but faces significant challenges from these domain shifts, particularly when client data is unlabeled. However, most existing FL methods unrealistically assume access to labeled data on remote clients or fail to leverage the power of modern Vision Foundation Models (VFMs). Here, we propose a novel and challenging task, FFREEDG, in which a model is pretrained on a server's labeled source dataset and subsequently trained across clients using only their unlabeled data, without ever re‑accessing the source. To solve FFREEDG, we propose FRIEREN, a framework that leverages the knowledge of a VFM by integrating vision and language modalities. Our approach employs a Vision‑Language decoder guided by CLIP‑based text embeddings to improve semantic disambiguation and uses a weak‑to‑strong consistency learning strategy for robust local training on pseudo‑labels. Our experiments on synthetic‑to‑real and clear‑to‑adverse‑weather benchmarks demonstrate that our framework effectively tackles this new task, achieving competitive performance against established domain generalization and adaptation methods and setting a strong baseline for future research.

Abstract:
Video object segmentation (VOS) models such as SAM2 offer promising zero‑shot tracking capabilities for surgical videos using minimal user input. Among the available input types, point‑based tracking offers an efficient and low‑cost alternative, yet its reliability and failure cases in complex surgical environments are not well understood. In this work, we systematically analyze the failure modes of point‑based tracking in laparoscopic cholecystectomy videos. Focusing on three surgical targets, the gallbladder, grasper, and L‑hook electrocautery, we compare the performance of point‑based tracking with segmentation mask initialization. Our results show that point‑based tracking is competitive for surgical tools but consistently underperforms for anatomical targets, where tissue similarity and ambiguous boundaries lead to failure. Through qualitative analysis, we reveal key factors influencing tracking outcomes and provide several actionable recommendations for selecting and placing tracking points to improve performance in surgical video analysis.

Abstract:
Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real‑world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo‑clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

Abstract:
Worldwide visual geo‑localization aims to determine the geographic location of an image anywhere on Earth using only its visual content. Despite recent progress, learning expressive representations of geographic space remains challenging due to the inherently low‑dimensional nature of geographic coordinates. We formulate global geo‑localization as aligning the visual representation of a query image with a learned geographic representation. Our approach explicitly models the world as a hierarchy of learned geographic embeddings, enabling a distributed and multi‑scale representation of geographic space. In addition, we introduce a semantic fusion module that efficiently integrates appearance features with semantic segmentation through latent cross‑attention, producing a more robust visual representation for localization. Experiments on five widely used geo‑localization benchmarks demonstrate that our method achieves new state‑of‑the‑art results on 22 of 25 reported metrics. Ablation studies show that these improvements are primarily driven by the proposed geographic representation and semantic fusion mechanism.

Abstract:
In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human‑centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM's context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi‑modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLMs to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task‑agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical multi‑modal retrieval‑augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across three distinct benchmarks, 3D object semantic segmentation, functional element segmentation, and complex query retrieval, KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.

Abstract:
Understanding how natural language phrases correspond to specific regions in images is a key challenge in multimodal semantic segmentation. Recent advances in phrase grounding are largely limited to single‑view images, neglecting the rich geometric cues available in stereo vision. For this, we introduce PhraseStereo, the first novel dataset that brings phrase‑region segmentation to stereo image pairs. PhraseStereo builds upon the PhraseCut dataset by leveraging GenStereo to generate accurate right‑view images from existing single‑view data, enabling the extension of phrase grounding into the stereo domain. This new setting introduces unique challenges and opportunities for multimodal learning, particularly in leveraging depth cues for more precise and context‑aware grounding. By providing stereo image pairs with aligned segmentation masks and phrase annotations, PhraseStereo lays the foundation for future research at the intersection of language, vision, and 3D perception, encouraging the development of models that can reason jointly over semantics and geometry. The PhraseStereo dataset will be released online upon acceptance of this work.

Abstract:
Building facades represent a significant untapped resource for solar energy generation in dense urban environments, yet assessing their photovoltaic (PV) potential remains challenging due to complex geometries and semantic com ponents. This study introduces SF‑SPA (Semantic Facade Solar‑PV Assessment), an automated framework that transforms street‑view photographs into quantitative PV deployment assessments. The approach combines com puter vision and artificial intelligence techniques to address three key challenges: perspective distortion correction, semantic understanding of facade elements, and spatial reasoning for PV layout optimization. Our four‑stage pipeline processes images through geometric rectification, zero‑shot semantic segmentation, Large Language Model (LLM) guided spatial reasoning, and energy simulation. Validation across 80 buildings in four countries demonstrates ro bust performance with mean area estimation errors of 6.2% ± 2.8% compared to expert annotations. The auto mated assessment requires approximately 100 seconds per building, a substantial gain in efficiency over manual methods. Simulated energy yield predictions confirm the method's reliability and applicability for regional poten tial studies, urban energy planning, and building‑integrated photovoltaic (BIPV) deployment. Code is available at: https:github.com/CodeAXu/Solar‑PV‑Installation

Abstract:
To fully leverage spatial information for remote sensing image segmentation and address semantic edge ambiguities caused by grayscale variations (e.g., shadows and low‑contrast regions), we propose the Frequency and Spatial Domains based Detail Enhancement Network (FSDENet). Our framework employs spatial processing methods to extract rich multi‑scale spatial features and fine‑grained semantic details. By effectively integrating global and frequency‑domain information through the Fast Fourier Transform (FFT) in global mappings, the model's capability to discern global representations under grayscale variations is significantly strengthened. Additionally, we utilize Haar wavelet transform to decompose features into high‑ and low‑frequency components, leveraging their distinct sensitivity to edge information to refine boundary segmentation. The model achieves dual‑domain synergy by integrating spatial granularity with frequency‑domain edge sensitivity, substantially improving segmentation accuracy in boundary regions and grayscale transition zones. Comprehensive experimental results demonstrate that FSDENet achieves state‑of‑the‑art (SOTA) performance on four widely adopted datasets: LoveDA, Vaihingen, Potsdam, and iSAID.

Abstract:
Existing multimodal audio generation models often lack precise user control, which limits their applicability in professional Foley workflows. In particular, these models focus on the entire video and do not provide precise methods for prioritizing a specific object within a scene, generating unnecessary background sounds, or focusing on the wrong objects. To address this gap, we introduce the novel task of video object segmentation‑aware audio generation, which explicitly conditions sound synthesis on object‑level segmentation maps. We present SAGANet, a new multimodal generative model that enables controllable audio generation by leveraging visual segmentation masks along with video and textual cues. Our model provides users with fine‑grained and visually localized control over audio generation. To support this task and further research on segmentation‑aware Foley, we propose Segmented Music Solos, a benchmark dataset of musical instrument performance videos with segmentation information. Our method demonstrates substantial improvements over current state‑of‑the‑art methods and sets a new standard for controllable, high‑fidelity Foley synthesis. Code, samples, and Segmented Music Solos are available at https://saganet.notion.site

Abstract:
Breast magnetic resonance imaging is a critical tool for cancer detection and treatment planning, but its clinical utility is hindered by poor specificity, leading to high false‑positive rates and unnecessary biopsies. This study introduces a transformer‑based framework for automated classification of breast lesions in dynamic contrast‑enhanced MRI, addressing the challenge of distinguishing benign from malignant findings. We implemented a SegFormer architecture that achieved an AUC of 0.92 for lesion‑level classification, with 100% sensitivity and 67% specificity at the patient level ‑ potentially eliminating one‑third of unnecessary biopsies without missing malignancies. The model quantifies malignant pixel distribution via semantic segmentation, producing interpretable spatial predictions that support clinical decision‑making. To establish reproducible benchmarks, we curated BreastDCEDL_AMBL by transforming The Cancer Imaging Archive's AMBL collection into a standardized deep learning dataset with 88 patients and 133 annotated lesions (89 benign, 44 malignant). This resource addresses a key infrastructure gap, as existing public datasets lack benign lesion annotations, limiting benign‑malignant classification research. Training incorporated an expanded cohort of over 1,200 patients through integration with BreastDCEDL datasets, validating transfer learning approaches despite primary tumor‑only annotations. Public release of the dataset, models, and evaluation protocols provides the first standardized benchmark for DCE‑MRI lesion classification, enabling methodological advancement toward clinical deployment.

Abstract:
Parameter‑Efficient Fine‑Tuning (PEFT) of foundation models for agricultural computer vision tasks remains challenging due to limited training data and complex field conditions. This study introduces a Dynamic Similarity‑based Graph Adaptation (DSGA) module to adapt the Segment Anything Model (SAM) under extreme data constraints for precise foreground and instance segmentation of small dense objects in complex agricultural environments. Through dynamic similarity graph construction with a learnable polynomial decay‑initialized weight ranking mechanism and adaptive local feature aggregation, DSGA establishes robust spatial and dynamic similarity representation with only 4.00M trainable parameters, which is 4.26% of the original SAM. Integrating this graph‑based feature adaptation with Low‑Rank Adaptation (LoRA) creates a complementary optimization framework that effectively captures both local and global dependencies in image embeddings while preserving model stability and parameter efficiency. Experimental results on a challenging chickpea pod dataset demonstrated that DSGA with LoRA achieved superior performance across multiple metrics evaluated under 2, 4, 8 and 10 shots, with progressive performance gains as shot count increased. Quantitative metrics showed a 17.31% improvement in Structure‑measure and a 62.36% gain in adaptive F‑measure compared to the baseline SAM fine‑tuning. Comprehensive ablation studies and visualization analyses through Grad‑CAM and t‑SNE validated the framework's effectiveness in feature discrimination. The proposed adaptation demonstrated practical utility for automated agricultural monitoring applications, achieving accurate pod‑counting with an adjusted R‑squared of 0.8987 for images with 10 to 120 pods under challenging field conditions.

Abstract:
Video Object Segmentation (VOS) aims to track and segment specific objects across entire video sequences, yet it remains highly challenging under complex real‑world scenarios. The MOSEv1 and LVOS dataset, adopted in the MOSEv1 challenge on LSVOS 2025, which is specifically designed to enhance the robustness of VOS models in complex real‑world scenarios, including long‑term object disappearances and reappearances, as well as the presence of small and inconspicuous objects. In this paper, we present our improved method, Confidence‑Guided Fusion Segmentation (CGFSeg), for the VOS task in the MOSEv1 Challenge. During training, the feature extractor of SAM2 is frozen, while the remaining components are fine‑tuned to preserve strong feature extraction ability and improve segmentation accuracy. In the inference stage, we introduce a pixel‑check strategy that progressively refines predictions by exploiting complementary strengths of multiple models, thereby yielding robust final masks. As a result, our method achieves a J&F score of 86.37% on the test set, ranking 1st in the MOSEv1 Challenge at LSVOS 2025. These results highlight the effectiveness of our approach in addressing the challenges of VOS task in complex scenarios.

Abstract:
Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node‑neighbor feature aggregation method. Although various graph convolution methods, such as Max‑Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node‑neighbor relationships without requiring architecture‑specific refinements is needed. To address this gap, we propose a cross‑attention‑based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross‑attention aggregation scheme to conduct non‑local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet‑1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.

Abstract:
Deep learning models are increasingly used for radiographic analysis, but their reliability is challenged by the stochastic noise inherent in clinical imaging. A systematic, cross‑task understanding of how different noise types impact these models is lacking. Here, we evaluate the robustness of state‑of‑the‑art convolutional neural networks (CNNs) to simulated quantum (Poisson) and electronic (Gaussian) noise in two key chest X‑ray tasks: semantic segmentation and pulmonary disease classification. Using a novel, scalable noise injection framework, we applied controlled, clinically‑motivated noise severities to common architectures (UNet, DeepLabV3, FPN; ResNet, DenseNet, EfficientNet) on public datasets (Landmark, ChestX‑ray14). Our results reveal a stark dichotomy in task robustness. Semantic segmentation models proved highly vulnerable, with lung segmentation performance collapsing under severe electronic noise (Dice Similarity Coefficient drop of 0.843), signifying a near‑total model failure. In contrast, classification tasks demonstrated greater overall resilience, but this robustness was not uniform. We discovered a differential vulnerability: certain tasks, such as distinguishing Pneumothorax from Atelectasis, failed catastrophically under quantum noise (AUROC drop of 0.355), while others were more susceptible to electronic noise. These findings demonstrate that while classification models possess a degree of inherent robustness, pixel‑level segmentation tasks are far more brittle. The task‑ and noise‑specific nature of model failure underscores the critical need for targeted validation and mitigation strategies before the safe clinical deployment of diagnostic AI.

Abstract:
This study presents a comprehensive analysis of Ultralytics YOLO26(also called as YOLOv26), highlighting its key architectural enhancements and performance benchmarking for real‑time object detection. YOLO26, released in September 2025, stands as the newest and most advanced member of the YOLO family, purpose‑built to deliver efficiency, accuracy, and deployment readiness on edge and low‑power devices. The paper sequentially details architectural innovations of YOLO26, including the removal of Distribution Focal Loss (DFL), adoption of end‑to‑end NMS‑free inference, integration of ProgLoss and Small‑Target‑Aware Label Assignment (STAL), and the introduction of the MuSGD optimizer for stable convergence. Beyond architecture, the study positions YOLO26 as a multi‑task framework, supporting object detection, instance segmentation, pose/keypoints estimation, oriented detection, and classification. We present performance benchmarks of YOLO26 on edge devices such as NVIDIA Jetson Nano and Orin, comparing its results with YOLOv8, YOLOv11, YOLOv12, YOLOv13, and transformer‑based detectors(RF‑DETR and RT‑DETR). This paper further explores real‑time deployment pathways, flexible export options (ONNX, TensorRT, CoreML, TFLite), and quantization for INT8/FP16. Practical use cases of YOLO26 across robotics, manufacturing, and IoT are highlighted to demonstrate cross‑industry adaptability. Finally, insights on deployment efficiency and broader implications are discussed, with future directions for YOLO26 and the YOLO lineage outlined.

Abstract:
This paper develops a mathematical argument and algorithms for building representations of data from event‑based cameras, that we call Fast Feature Field (\textF^3). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. \textF^3 exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi‑resolution hash encoding and deep sets ‑ achieving 120 Hz at HD and 440 Hz at VGA resolutions. \textF^3 represents events within a contiguous spatiotemporal volume as a multi‑channel image, enabling a range of downstream tasks. We obtain state‑of‑the‑art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off‑road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25‑75 Hz at HD resolution.

Abstract:
Vision Graph Neural Networks (Vision GNNs, or ViGs) represent images as unstructured graphs, achieving state of the art performance in computer vision tasks such as image classification, object detection, and instance segmentation. Dynamic Image Graph Construction (DIGC) builds image graphs by connecting patches (nodes) based on feature similarity, and is dynamically repeated in each ViG layer following GNN based patch (node) feature updates. However, DIGC constitutes over 50% of end to end ViG inference latency, rising to 95% at high image resolutions, making it the dominant computational bottleneck. While hardware acceleration holds promise, prior works primarily optimize graph construction algorithmically, often compromising DIGC flexibility, accuracy, or generality. To address these limitations, we propose a streaming, deeply pipelined FPGA accelerator for DIGC, featuring on chip buffers that process input features in small, uniform blocks. Our design minimizes external memory traffic via localized computation and performs efficient parallel sorting with local merge sort and global k way merging directly on streaming input blocks via heap insertion. This modular architecture scales seamlessly across image resolutions, ViG layer types, and model sizes and variants, and supports DIGC across diverse ViG based vision backbones. The design achieves high clock frequencies post place and route due to the statically configured parallelism minimizing critical path delay and delivers up to 16.6x and 6.8x speedups over optimized CPU and GPU DIGC baselines.

Abstract:
Inspired by the human visual system's mechanisms for contrast enhancement and color‑opponency, we explore biologically motivated input preprocessing for robust semantic segmentation. By applying Difference‑of‑Gaussians (DoG) filtering to RGB, grayscale, and opponent‑color channels, we enhance local contrast without modifying model architecture or training. Evaluations on Cityscapes, ACDC, and Dark Zurich show that such preprocessing maintains in‑distribution performance while improving robustness to adverse conditions like night, fog, and snow. As this processing is model‑agnostic and lightweight, it holds potential for integration into imaging pipelines, enabling imaging systems to deliver task‑ready, robust inputs for downstream vision models in safety‑critical environments.

Abstract:
Accurate segmentation of floating debris on water is often compromised by surface glare and changing outdoor illumination. Polarimetric imaging offers a single‑sensor route to mitigate water‑surface glare that disrupts semantic segmentation of floating objects. We benchmark state‑of‑the‑art fusion networks on PoTATO, a public dataset of polarimetric images of plastic bottles in inland waterways, and compare their performance with single‑image baselines using traditional models. Our results indicate that polarimetric cues help recover low‑contrast objects and suppress reflection‑induced false positives, raising mean IoU and lowering contour error relative to RGB inputs. These sharper masks come at a cost: the additional channels enlarge the models increasing the computational load and introducing the risk of new false positives. By providing a reproducible, diagnostic benchmark and publicly available code, we hope to help researchers choose if polarized cameras are suitable for their applications and to accelerate related research.

Abstract:
Open‑vocabulary camouflaged object segmentation requires models to segment camouflaged objects of arbitrary categories unseen during training, placing extremely high demands on generalization capabilities. Through analysis of existing methods, it is observed that the classification component significantly affects overall segmentation performance. Accordingly, a classifier‑centric adaptive framework is proposed to enhance segmentation performance by improving the classification component via a lightweight text adapter with a novel layered asymmetric initialization. Through the classification enhancement, the proposed method achieves substantial improvements in segmentation metrics compared to the OVCoser baseline on the OVCamo benchmark: cIoU increases from 0.443 to 0.493, cSm from 0.579 to 0.658, and cMAE reduces from 0.336 to 0.239. These results demonstrate that targeted classification enhancement provides an effective approach for advancing camouflaged object segmentation performance.

Abstract:
A central goal in AI is to represent scenes as compositions of discrete objects, enabling fine‑grained, controllable image and video generation. Yet leading diffusion models treat images holistically and rely on text conditioning, creating a mismatch for object‑level editing. This thesis introduces a framework that adapts powerful pretrained diffusion models for object‑centric synthesis while retaining their generative capacity. We identify a core challenge: balancing global scene coherence with disentangled object control. Our method integrates lightweight, slot‑based conditioning into pretrained models, preserving their visual priors while providing object‑specific manipulation. For images, SlotAdapt augments diffusion models with a register token for background/style and slot‑conditioned modules for objects, reducing text‑conditioning bias and achieving state‑of‑the‑art results in object discovery, segmentation, compositional editing, and controllable image generation. We further extend the framework to video. Using Invariant Slot Attention (ISA) to separate object identity from pose and a Transformer‑based temporal aggregator, our approach maintains consistent object representations and dynamics across frames. This yields new benchmarks in unsupervised video object segmentation and reconstruction, and supports advanced editing tasks such as object removal, replacement, and insertion without explicit supervision. Overall, this work establishes a general and scalable approach to object‑centric generative modeling for images and videos. By bridging human object‑based perception and machine learning, it expands the design space for interactive, structured, and user‑driven generative tools in creative, scientific, and practical domains.

Abstract:
3D scene understanding is fundamental for embodied AI and robotics, supporting reliable perception for interaction and navigation. Recent approaches achieve zero‑shot, open‑vocabulary 3D semantic mapping by assigning embedding vectors to 2D class‑agnostic masks generated via vision‑language models (VLMs) and projecting these into 3D. However, these methods often produce fragmented masks and inaccurate semantic assignments due to the direct use of raw masks, limiting their effectiveness in complex environments. To address this, we leverage SemanticSAM with progressive granularity refinement to generate more accurate and numerous object‑level masks, mitigating the over‑segmentation commonly observed in mask generation models such as vanilla SAM, and improving downstream 3D semantic segmentation. To further enhance semantic context, we employ a context‑aware CLIP encoding strategy that integrates multiple contextual views of each mask using empirically determined weighting, providing much richer visual context. We evaluate our approach on multiple 3D scene understanding tasks, including 3D semantic segmentation and object retrieval from language queries, across several benchmark datasets. Experimental results demonstrate significant improvements over existing methods, highlighting the effectiveness of our approach.

Abstract:
Multimodal semantic segmentation enhances model robustness by exploiting cross‑modal complementarities. However, existing methods often suffer from imbalanced modal dependencies, where overall performance degrades significantly once a dominant modality deteriorates in real‑world scenarios. Thus, modality balance has become acritical challenge for practical multimodal segmentation. To address this issue, we propose EQUISeg, a multimodal segmentation framework that balances modality contributions through equal encoding of modalities. Built upon a four‑stage Cross‑modal Transformer Block(CMTB), EQUISeg enables efficient multimodal fusion and hierarchical selection. Furthermore, we design a Self‑guided Module(SGM) that mitigates modality imbalance by introducing a mutual guidance mechanism, enabling each modality to adaptively adjust its contribution and enhance robustness under degraded conditions. Extensive experiments on multiple datasets demonstrate that EQUISeg achieves significant performance gains and effectively alleviates the adverse effects of modality imbalance in segmentation tasks.

Abstract:
As a general‑purpose vision‑language pretraining model, CLIP demonstrates strong generalization ability in image‑text alignment tasks and has been widely adopted in downstream applications such as image classification and image‑text retrieval. However, it struggles with fine‑grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP's generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross‑task transfer behavior of CLIP‑based models on image‑text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine‑grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse‑grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi‑Task Adversarial CLIP (MT‑AdvCLIP), which introduces a task‑aware feature aggregation loss and generates perturbations with enhanced cross‑task generalization capability. This design strengthens the attack effectiveness of fine‑grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT‑AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP‑derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi‑task CLIP models, offering new insights into multi‑task robustness evaluation and adversarial example design.

Abstract:
Semi‑supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first‑frame mask. Previous methods rely heavily on appearance‑based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high‑level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision‑Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero‑shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine‑tuning on the training set, SeC achieved 39.7 \JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large‑scale Video Object Segmentation Challenge.

Abstract:
Procedural Content Generation (PCG) techniques enable automatic creation of diverse and complex environments. While PCG facilitates more efficient content creation, ensuring consistently high‑quality, industry‑standard content remains a significant challenge. In this research, we propose a method to identify and repair unstable levels generated by existing PCG models. We use Angry Birds as a case study, demonstrating our method on game levels produced by established PCG approaches. Our method leverages object segmentation and visual analysis of level images to detect structural gaps and perform targeted repairs. We evaluate multiple object segmentation models and select the most effective one as the basis for our repair pipeline. Experimental results show that our method improves the stability and playability of AI‑generated levels. Although our evaluation is specific to Angry Birds, our image‑based approach is designed to be applicable to a wide range of 2D games with similar level structures.

Abstract:
Multi‑task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi‑task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self‑training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that addresses this limitation by leveraging Vision Foundation Models (VFMs) as powerful teachers within a self‑training paradigm. Our approach integrates Segmentation and Depth foundation models into a self‑training paradigm to generate high‑quality pseudo‑labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state‑of‑the‑art (SOTA) performance on standard synthetic‑to‑real UDA multi‑task learning (MTL) benchmarks and a challenging new day‑to‑night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10X smaller than foundation models, highlighting FAMDA's suitability for creating domain‑adaptive and efficient models for resource‑constrained robotics applications.

Abstract:
In this paper, we propose a training scheme called OVSeg3R to learn open‑vocabulary 3D instance segmentation from well‑studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real‑world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open‑vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub‑scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View‑wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over‑smooth geometric details, clustering reconstructed points into representative super‑points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non‑salient objects. We therefore introduce 2D Instance Boundary‑aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state‑of‑the‑art closed‑vocabulary 3D instance segmentation model to open‑vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open‑vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

Abstract:
Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo‑labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non‑adjacent frames, allowing the model to learn from long‑range dependencies as well as short‑term variations. In addition, a dynamic‑weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.

Abstract:
Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual‑only training objectives allow queries to converge to arbitrary salient objects. We propose Audio‑Centric Query Generation using cross‑attention, enabling each query to selectively attend to distinct sound sources and carry sound‑specific priors into visual decoding. Additionally, we introduce Sound‑Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual‑only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.

Abstract:
Generative Zero‑Shot Learning approach (GZSL) has demonstrated significant potential in 3D point cloud semantic segmentation tasks. GZSL leverages generative models like GANs or VAEs to synthesize realistic features (real features) of unseen classes. This allows the model to label unseen classes during testing, despite being trained only on seen classes. In this context, we introduce the Generalized Zero‑Shot Learning based‑upon Mixture‑of‑Experts (GZSL‑MoE) model. This model incorporates Mixture‑of‑Experts layers (MoE) to generate fake features that closely resemble real features extracted using a pre‑trained KPConv (Kernel Point Convolution) model on seen classes. The main contribution of this paper is the integration of Mixture‑of‑Experts into the Generator and Discriminator components of the Generative Zero‑Shot Learning model for 3D point cloud semantic segmentation, applied to the COVERED dataset (CollabOratiVE Robot Environment Dataset) for Human‑Robot Collaboration (HRC) environments. By combining the Generative Zero‑Shot Learning model with Mixture‑of‑ Experts, GZSL‑MoE for 3D point cloud semantic segmentation provides a promising solution for understanding complex 3D environments, especially when comprehensive training data for all object classes is unavailable. The performance evaluation of the GZSL‑MoE model highlights its ability to enhance performance on both seen and unseen classes. Keywords Generalized Zero‑Shot Learning (GZSL), 3D Point Cloud, 3D Semantic Segmentation, Human‑Robot Collaboration, COVERED (CollabOratiVE Robot Environment Dataset), KPConv, Mixture‑of Experts

Abstract:
Semi‑supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel‑level consistency, which propagates noisy pseudo‑labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph‑theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state‑of‑the‑art performance under 5‑10% supervision and significantly narrows the gap to full supervision. Code is available at https://github.com/hieuphamha19/TGC.

Abstract:
Most existing approaches to referring segmentation achieve strong performance only through fine‑tuning or by composing multiple pre‑trained models, often at the cost of additional training and architectural modifications. Meanwhile, large‑scale generative diffusion models encode rich semantic information, making them attractive as general‑purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision‑language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training‑free grounding framework that combines cross‑attention maps, GAS handling, and redistribution. Across zero‑shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine‑tuning, additional components and complex reasoning.

Abstract:
Visible and infrared image fusion (VIF) has gained significant attention in recent years due to its wide application in tasks such as scene segmentation and object detection. VIF methods can be broadly classified into traditional VIF methods and application‑oriented VIF methods. Traditional methods focus solely on improving the quality of fused images, while application‑oriented VIF methods additionally consider the performance of downstream tasks on fused images by introducing task‑specific loss terms during training. However, compared to traditional methods, application‑oriented VIF methods require datasets labeled for downstream tasks (e.g., semantic segmentation or object detection), making data acquisition labor‑intensive and time‑consuming. To address this issue, we propose a self‑supervised training framework for segmentation‑oriented VIF methods (SSVIF). Leveraging the consistency between feature‑level fusion‑based segmentation and pixel‑level fusion‑based segmentation, we introduce a novel self‑supervised task‑cross‑segmentation consistency‑that enables the fusion model to learn high‑level semantic features without the supervision of segmentation labels. Additionally, we design a two‑stage training strategy and a dynamic weight adjustment method for effective joint learning within our self‑supervised framework. Extensive experiments on public datasets demonstrate the effectiveness of our proposed SSVIF. Remarkably, although trained only on unlabeled visible‑infrared image pairs, our SSVIF outperforms traditional VIF methods and rivals supervised segmentation‑oriented ones. Our code will be released upon acceptance.

Abstract:
Semantic segmentation is a fundamental task in medical image analysis, aiding medical decision‑making by helping radiologists distinguish objects in an image. Research in this field has been driven by deep learning applications, which have the potential to scale these systems even in the presence of noise and artifacts. However, these systems are not yet perfected. We argue that performance can be improved by incorporating common medical knowledge into the segmentation model's loss function. To this end, we introduce Logic Tensor Networks (LTNs) to encode medical background knowledge using first‑order logic (FOL) rules. The encoded rules span from constraints on the shape of the produced segmentation, to relationships between different segmented areas. We apply LTNs in an end‑to‑end framework with a SwinUNETR for semantic segmentation. We evaluate our method on the task of segmenting the hippocampus in brain MRI scans. Our experiments show that LTNs improve the baseline segmentation performance, especially when training data is scarce. Despite being in its preliminary stages, we argue that neurosymbolic methods are general enough to be adapted and applied to other medical semantic segmentation tasks.

Abstract:
Lifting 2D open‑vocabulary understanding into 3D Gaussian Splatting (3DGS) scenes is a critical challenge. Mainstream methods, built on an embedding paradigm, suffer from three key flaws: (i) geometry‑semantic inconsistency, where points, rather than objects, serve as the semantic basis, limiting semantic fidelity; (ii) semantic bloat from injecting gigabytes of feature data into the geometry; and (iii) semantic rigidity, as one feature per Gaussian struggles to capture rich polysemy. To overcome these limitations, we introduce ExtrinSplat, a framework built on the extrinsic paradigm that decouples geometry from semantics. Instead of embedding features, ExtrinSplat clusters Gaussians into multi‑granularity, overlapping 3D object groups. A Vision‑Language Model (VLM) then interprets these groups to generate lightweight textual hypotheses, creating an extrinsic index layer that natively supports complex polysemy. By replacing costly feature embedding with lightweight indices, ExtrinSplat reduces scene adaptation time from hours to minutes and lowers storage overhead by several orders of magnitude. On benchmark tasks for open‑vocabulary 3D object selection and semantic segmentation, ExtrinSplat outperforms established embedding‑based frameworks, validating the efficacy and efficiency of the proposed extrinsic paradigm.

Abstract:
Visual In‑Context Learning (VICL) uses input‑output image pairs, referred to as in‑context pairs (or examples), as prompts alongside query images to guide models in performing diverse vision tasks. However, VICL often suffers from over‑reliance on a single in‑context pair, which can lead to biased and unstable predictions. We introduce PAtch‑based k‑Nearest neighbor visual In‑Context Learning (PANICL), a general training‑free framework that mitigates this issue by leveraging multiple in‑context pairs. PANICL smooths assignment scores across pairs, reducing bias without requiring additional training. Extensive experiments on a variety of tasks, including foreground segmentation, single object detection, colorization, multi‑object segmentation, and keypoint detection, demonstrate consistent improvements over strong baselines. Moreover, PANICL exhibits strong robustness to domain shifts, including dataset‑level shift (e.g., from COCO to Pascal) and label‑space shift (e.g., FSS‑1000), and generalizes well to other VICL models such as SegGPT, Painter, and LVM, highlighting its versatility and broad applicability.

Abstract:
Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba‑style scanning within shifted windows with a global receptive field, to enhance the model's perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine‑grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter‑region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state‑of‑the‑art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.

Abstract:
Architecture embodies aesthetic, cultural, and historical values, standing as a tangible testament to human civilization. Researchers have long leveraged virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable immersive exploration and interpretation of architecture, enhancing accessibility, public understanding, and creative workflows around architecture in education, heritage preservation, and professional design practice. However, existing VR/MR/AR systems are often developed case‑by‑case, relying on hard‑coded annotations and task‑specific interactions that do not scale across diverse built environments. In this work, we present ArchGPT, a multimodal architectural visual question answering (VQA) model, together with a scalable data‑construction pipeline for curating high‑quality, architecture‑specific VQA annotations. This pipeline yields Arch‑300K, a domain‑specialized dataset of approximately 315,000 image‑question‑answer triplets. Arch‑300K is built via a multi‑stage process: first, we curate architectural scenes from Wikimedia Commons and filter unconstrained tourist photo collections using a novel coarse‑to‑fine strategy that integrates 3D reconstruction and semantic segmentation to select occlusion‑free, structurally consistent architectural images. To mitigate noise and inconsistency in raw textual metadata, we propose an LLM‑guided text verification and knowledge‑distillation pipeline to generate reliable, architecture‑specific question‑answer pairs. Using these curated images and refined metadata, we further synthesize formal analysis annotations‑including detailed descriptions and aspect‑guided conversations‑to provide richer semantic variety while remaining faithful to the data. We perform supervised fine‑tuning of an open‑source multimodal backbone ,ShareGPT4V‑7B, on Arch‑300K, yielding ArchGPT.

Abstract:
Semantic segmentation of LiDAR data presents considerable challenges, particularly when dealing with diverse sensor types and configurations. However, incorporating semantic information can significantly enhance the accuracy and robustness of LiDAR‑based localization techniques for autonomous mobile systems. We propose an approach that integrates semantic camera data with LiDAR segmentation to address this challenge. By projecting LiDAR points into the semantic segmentation space of the camera, our method enhances the precision and reliability of the LiDAR‑based localization pipeline. For validation, we utilize the CoCar NextGen platform from the FZI Research Center for Information Technology, which offers diverse sensor modalities and configurations. The sensor setup of CoCar NextGen enables a thorough analysis of different sensor types. Our evaluation leverages the state‑of‑the‑art Depth‑Anything network for camera image segmentation and an adaptive segmentation network for LiDAR segmentation. To establish a reliable ground truth for LiDAR‑based localization, we make us of a Global Navigation Satellite System (GNSS) solution with Real‑Time Kinematic corrections (RTK). Additionally, we conduct an extensive 55 km drive through the city of Karlsruhe, Germany, covering a variety of environments, including urban areas, multi‑lane roads, and rural highways. This multimodal approach paves the way for more reliable and precise autonomous navigation systems, particularly in complex real‑world environments.

Abstract:
The majority of AI models in imaging and vision are customized to perform on specific high‑precision task. However, this strategy is inefficient for applications with a series of modular tasks, since each requires a mapping into a disparate latent domain. To address this inefficiency, we proposed a universal Neural Space (NS), where an encoder‑decoder framework pre‑computes features across vision and imaging tasks. Our encoder learns transformation aware, generalizable representations, which enable multiple downstream AI modules to share the same feature space. This architecture reduces redundancy, improves generalization across domain shift, and establishes a foundation for effecient multi‑task vision pipelines. Furthermore, as opposed to larger transformer backbones, our backbone is lightweight and CNN‑based, allowing for wider across hardware. We furthur demonstrate that imaging and vision modules, such as demosaicing, denoising, depth estimation and semantic segmentation can be performed efficiently in the NS.

Abstract:
Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum‑aware spatial prior module to extract rich spatial‑spectral features. Additionally, we introduce a modality‑aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state‑of‑the‑art semantic segmentation performance while directly using HSI inputs, outperforming both vision‑based and hyperspectral segmentation methods. We make the code available at https://hsi‑adapter.cs.uni‑freiburg.de.

Abstract:
Visual impairments present significant challenges to individuals worldwide, impacting daily activities and quality of life. Visual neuroprosthetics offer a promising solution, leveraging advancements in technology to provide a simplified visual sense through devices comprising cameras, computers, and implanted electrodes. This study investigates user‑centered design principles for a phosphene vision algorithm, utilizing feedback from visually impaired individuals to guide the development of a gaze‑controlled semantic segmentation system. We conducted interviews revealing key design principles. These principles informed the implementation of a gaze‑guided semantic segmentation algorithm using the Segment Anything Model (SAM). In a simulated phosphene vision environment, participants performed object detection tasks under SAM, edge detection, and normal vision conditions. SAM improved identification accuracy over edge detection, remained effective in complex scenes, and was particularly robust for specific object shapes. These findings demonstrate the value of user feedback and the potential of gaze‑guided semantic segmentation to enhance neuroprosthetic vision.

Abstract:
Low‑latency intelligent systems are required for autonomous driving on non‑uniform terrain in open‑pit mines and developing countries. This work proposes a perception system for autonomous vehicles on unpaved roads and off‑road environments, capable of navigating rough terrain without a predefined trail. The Configurable Modular Segmentation Network (CMSNet) framework is proposed, facilitating different architectural arrangements. CMSNet configurations were trained to segment obstacles and trafficable ground on new images from unpaved/off‑road scenarios with adverse conditions (night, rain, dust). We investigated applying deep learning to detect drivable regions without explicit track boundaries, studied algorithm behavior under visibility impairment, and evaluated field tests with real‑time semantic segmentation. A new dataset, Kamino, is presented with almost 12,000 images from an operating vehicle with eight synchronized cameras. The Kamino dataset has a high number of labeled pixels compared to similar public collections and includes images from an off‑road proving ground emulating a mine under adverse visibility. To achieve real‑time inference, CMSNet CNN layers were methodically removed and fused using TensorRT, C++, and CUDA. Empirical experiments on two datasets validated the proposed system's effectiveness.

Abstract:
Accurate plant segmentation in thermal imagery remains a significant challenge for high throughput field phenotyping, particularly in outdoor environments where low contrast between plants and weeds and frequent occlusions hinder performance. To address this, we present a framework that leverages synthetic RGB imagery, a limited set of real annotations, and GAN‑based cross‑modality alignment to enhance semantic segmentation in thermal images. We trained models on 1,128 synthetic images containing complex mixtures of crop and weed plants in order to generate image segmentation masks for crop and weed plants. We additionally evaluated the benefit of integrating as few as five real, manually segmented field images within the training process using various sampling strategies. When combining all the synthetic images with a few labeled real images, we observed a maximum relative improvement of 22% for the weed class and 17% for the plant class compared to the full real‑data baseline. Cross‑modal alignment was enabled by translating RGB to thermal using CycleGAN‑turbo, allowing robust template matching without calibration. Results demonstrated that combining synthetic data with limited manual annotations and cross‑domain translation via generative models can significantly boost segmentation performance in complex field environments for multi‑model imagery.

Abstract:
This technical report explores the MOSEv2 track of the LSVOS Challenge, which targets complex semi‑supervised video object segmentation. By analysing and adapting SeC, an enhanced SAM‑2 framework, we conduct a detailed study of its long‑term memory and concept‑aware memory, showing that long‑term memory preserves temporal continuity under occlusion and reappearance, while concept‑aware memory supplies semantic priors that suppress distractors; together, these traits directly benefit several MOSEv2's core challenges. Our solution achieves a JF score of 39.89% on the test set, ranking 1st in the MOSEv2 track of the LSVOS Challenge.

Abstract:
Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure‑based vision sensors offer scalable and efficient solutions for continuous real‑time monitoring, facilitating automated detection of accidents directly from captured images. This research investigates the zero‑shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure‑based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in accident identification and descriptive capabilities without prior fine‑tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi‑object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1‑score of 0.71 and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real‑world automated traffic monitoring systems.

Abstract:
In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero‑shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image‑level annotations, eliminating the need for pixel‑level annotations during training. Additionally, to enhance the quality of the SAM‑generated masks, we examine the use of image preprocessing techniques in combination with single‑mask and multi‑mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi‑mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

Abstract:
Domain adaptive segmentation (DAS) of numerous organelle instances from large‑scale electron microscopy (EM) is a promising way to enable annotation‑efficient learning. Inspired by SAM, we propose a promptable multitask framework, namely Prompt‑DAS, which is flexible enough to utilize any number of point prompts during the adaptation training stage and testing stage. Thus, with varying prompt configurations, Prompt‑DAS can perform unsupervised domain adaptation (UDA) and weakly supervised domain adaptation (WDA), as well as interactive segmentation during testing. Unlike the foundation model SAM, which necessitates a prompt for each individual object instance, Prompt‑DAS is only trained on a small dataset and can utilize full points on all instances, sparse points on partial instances, or even no points at all, facilitated by the incorporation of an auxiliary center‑point detection task. Moreover, a novel prompt‑guided contrastive learning is proposed to enhance discriminative feature learning. Comprehensive experiments conducted on challenging benchmarks demonstrate the effectiveness of the proposed approach over existing UDA, WDA, and SAM‑based approaches.

Abstract:
This paper presents an advanced tumor segmentation framework for real‑time MRI‑guided radiotherapy, designed for the TrackRAD2025 challenge. Our method leverages the XMem model, a memory‑augmented architecture, to segment tumors across long cine‑MRI sequences. The proposed system efficiently integrates memory mechanisms to track tumor motion in real‑time, achieving high segmentation accuracy even under challenging conditions with limited annotated data. Unfortunately, the detailed experimental records have been lost, preventing us from reporting precise quantitative results at this stage. Nevertheless, From our preliminary impressions during development, the XMem‑based framework demonstrated reasonable segmentation performance and satisfied the clinical real‑time requirement. Our work contributes to improving the precision of tumor tracking during MRI‑guided radiotherapy, which is crucial for enhancing the accuracy and safety of cancer treatments.

Abstract:
Research on unsupervised domain adaptation (UDA) for semantic segmentation of remote sensing images has been extensively conducted. However, research on how to achieve domain adaptation in practical scenarios where source domain data is inaccessible namely, source‑free domain adaptation (SFDA) remains limited. Self‑training has been widely used in SFDA, which requires obtaining as many high‑quality pseudo‑labels as possible to train models on target domain data. Most existing methods optimize the entire pseudo‑label set to obtain more supervisory information. However, as pseudo‑label sets often contain substantial noise, simultaneously optimizing all labels is challenging. This limitation undermines the effectiveness of optimization approaches and thus restricts the performance of self‑training. To address this, we propose a novel pseudo‑label optimization framework called Diffusion‑Guided Label Enrichment (DGLE), which starts from a few easily obtained high‑quality pseudo‑labels and propagates them to a complete set of pseudo‑labels while ensuring the quality of newly generated labels. Firstly, a pseudo‑label fusion method based on confidence filtering and super‑resolution enhancement is proposed, which utilizes cross‑validation of details and contextual information to obtain a small number of high‑quality pseudo‑labels as initial seeds. Then, we leverage the diffusion model to propagate incomplete seed pseudo‑labels with irregular distributions due to its strong denoising capability for randomly distributed noise and powerful modeling capacity for complex distributions, thereby generating complete and high‑quality pseudo‑labels. This method effectively avoids the difficulty of directly optimizing the complete set of pseudo‑labels, significantly improves the quality of pseudo‑labels, and thus enhances the model's performance in the target domain.

Abstract:
Synthetic data generation in histopathology faces unique challenges: preserving tissue heterogeneity, capturing subtle morphological features, and scaling to unannotated datasets. We present a latent diffusion model that generates realistic heterogeneous histopathology images through a novel dual‑conditioning approach combining semantic segmentation maps with tissue‑specific visual crops. Unlike existing methods that rely on text prompts or abstract visual embeddings, our approach preserves critical morphological details by directly incorporating raw tissue crops from corresponding semantic regions. For annotated datasets (i.e., Camelyon16, Panda), we extract patches ensuring 20‑80% tissue heterogeneity. For unannotated data (i.e., TCGA), we introduce a self‑supervised extension that clusters whole‑slide images into 100 tissue types using foundation model embeddings, automatically generating pseudo‑semantic maps for training. Our method synthesizes high‑fidelity images with precise region‑wise annotations, achieving superior performance on downstream segmentation tasks. When evaluated on annotated datasets, models trained on our synthetic data show competitive performance to those trained on real data, demonstrating the utility of controlled heterogeneous tissue generation. In quantitative evaluation, prompt‑guided synthesis reduces Frechet Distance by up to 6X on Camelyon16 (from 430.1 to 72.0) and yields 2‑3x lower FD across Panda and TCGA. Downstream DeepLabv3+ models trained solely on synthetic data attain test IoU of 0.71 and 0.95 on Camelyon16 and Panda, within 1‑2% of real‑data baselines (0.72 and 0.96). By scaling to 11,765 TCGA whole‑slide images without manual annotations, our framework offers a practical solution for an urgent need for generating diverse, annotated histopathology data, addressing a critical bottleneck in computational pathology.

Abstract:
Self‑supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large‑scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre‑trained on general‑purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While prior work has investigated parameter‑efficient adaptation methods like adapters, LoRA, and prompt tuning, primarily targeting downstream finetuning, extending the SSL pre‑training itself in a continual manner to new domains under limited data remains largely underexplored, especially for downstream dense prediction tasks like semantic segmentation. In this work, we address the challenge of adapting vision foundation models to low‑data target domains through continual self‑supervised pre‑training, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self‑supervised pre‑training task designed to enhance downstream semantic segmentation performance. GLARE introduces patch‑level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre‑training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules specifically UniAdapter ‑ while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.

Abstract:
Large‑scale Video Object Segmentation (LSVOS) addresses the challenge of accurately tracking and segmenting objects in long video sequences, where difficulties stem from object reappearance, small‑scale targets, heavy occlusions, and crowded scenes. Existing approaches predominantly adopt SAM2‑based frameworks with various memory mechanisms for complex video mask generation. In this report, we proposed Segment Anything with Memory Strengthened Object Navigation (SAMSON), the 3rd place solution in the MOSE track of ICCV 2025, which integrates the strengths of stateof‑the‑art VOS models into an effective paradigm. To handle visually similar instances and long‑term object disappearance in MOSE, we incorporate a long‑term memorymodule for reliable object re‑identification. Additionly, we adopt SAM2Long as a post‑processing strategy to reduce error accumulation and enhance segmentation stability in long video sequences. Our method achieved a final performance of 0.8427 in terms of J &F in the test‑set leaderboard.

Abstract:
We present Flow‑Induced Diagonal Gaussian Processes (FiD‑GP), a compression framework that incorporates a compact inducing weight matrix to project a neural network's weight uncertainty into a lower‑dimensional subspace. Critically, FiD‑GP relies on normalising‑flow priors and spectral regularisations to augment its expressiveness and align the inducing subspace with feature‑gradient geometry through a numerically stable projection mechanism objective. Furthermore, we demonstrate how the prediction framework in FiD‑GP can help to design a single‑pass projection for Out‑of‑Distribution (OoD) detection. Our analysis shows that FiD‑GP improves uncertainty estimation ability on various tasks compared with SVGP‑based baselines, satisfies tight spectral residual bounds with theoretically guaranteed OoD detection, and significantly compresses the neural network's storage requirements at the cost of increased inference computation dependent on the number of inducing weights employed. Specifically, in a comprehensive empirical study spanning regression, image classification, semantic segmentation, and out‑of‑distribution detection benchmarks, it cuts Bayesian training cost by several orders of magnitude, compresses parameters by roughly 51%, reduces model size by about 75%, and matches state‑of‑the‑art accuracy and uncertainty estimation.

Abstract:
Source‑Free Domain Adaptation (SFDA) enables domain adaptation for semantic segmentation of Remote Sensing Images (RSIs) using only a well‑trained source model and unlabeled target domain data. However, the lack of ground‑truth labels in the target domain often leads to the generation of noisy pseudo‑labels. Such noise impedes the effective mitigation of domain shift (DS). To address this challenge, we propose ProSFDA, a prototype‑guided SFDA framework. It employs prototype‑weighted pseudo‑labels to facilitate reliable self‑training (ST) under pseudo‑labels noise. We, in addition, introduce a prototype‑contrast strategy that encourages the aggregation of features belonging to the same class, enabling the model to learn discriminative target domain representations without relying on ground‑truth supervision. Extensive experiments show that our approach substantially outperforms existing methods.

Abstract:
For dynamic human motion sequences, the original KeyNode‑Driven codec often struggles to retain compression efficiency when confronted with rapid movements or strong non‑rigid deformations. This paper proposes a novel Bi‑modal coding framework that enhances the flexibility of motion representation by integrating semantic segmentation and region‑specific transformation modeling. The rigid transformation model (rotation & translation) is extended with a hybrid scheme that selectively applies affine transformations‑rotation, translation, scaling, and shearing‑only to deformation‑rich regions (e.g., the torso, where loose clothing induces high variability), while retaining rigid models elsewhere. The affine model is decomposed into minimal parameter sets for efficient coding and combined through a component selection strategy guided by a Lagrangian Rate‑Distortion optimization. The results show that the Bi‑modal method achieves more accurate mesh deformation, especially in sequences involving complex non‑rigid motion, without compromising compression efficiency in simpler regions, with an average bit‑rate saving of 33.81% compared to the baseline.

Abstract:
In this paper, we present SegDINO3D, a novel Transformer encoder‑decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre‑trained 2D detection model, including both image‑level and object‑level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross‑attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object‑level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre‑trained 2D model. The introducing of 3D box queries also enables the model to modulate cross‑attention using the predicted boxes for more precise querying. SegDINO3D achieves the state‑of‑the‑art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.6 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.

Abstract:
Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel‑ and point‑based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine‑grained geometry, they often incur high computational cost, irregular memory access, and limited real‑time efficiency. In contrast, range‑view methods, though relatively underexplored ‑ can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero‑shot recognition, and multimodal tasks, we investigate whether SAM2, the current state‑of‑the‑art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range‑view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back‑projection to operate on point clouds. To optimize SAM2 for range‑view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range‑view pseudo‑images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D‑centric pipelines. This work highlights the viability of VFMs as general‑purpose backbones for 3D perception and opens a path toward unified, foundation‑model‑driven LiDAR segmentation. Results lets us conclude that range‑view segmentation methods using VFMs leads to promising results.

Abstract:
Large‑scale land cover maps generated using deep learning play a critical role across a wide range of Earth science applications. Open in‑situ datasets from principled land cover surveys offer a scalable alternative to manual annotation for training such models. However, their sparse spatial coverage often leads to fragmented and noisy predictions when used with existing deep learning‑based land cover mapping approaches. A promising direction to address this issue is object‑based classification, which assigns labels to semantically coherent image regions rather than individual pixels, thereby imposing a minimum mapping unit. Despite this potential, object‑based methods remain underexplored in deep learning‑based land cover mapping pipelines, especially in the context of medium‑resolution imagery and sparse supervision. To address this gap, we propose LC‑SLab, the first deep learning framework for systematically exploring object‑based deep learning methods for large‑scale land cover classification under sparse supervision. LC‑SLab supports both input‑level aggregation via graph neural networks, and output‑level aggregation by postprocessing results from established semantic segmentation models. Additionally, we incorporate features from a large pre‑trained network to improve performance on small datasets. We evaluate the framework on annual Sentinel‑2 composites with sparse LUCAS labels, focusing on the tradeoff between accuracy and fragmentation, as well as sensitivity to dataset size. Our results show that object‑based methods can match or exceed the accuracy of common pixel‑wise models while producing substantially more coherent maps. Input‑level aggregation proves more robust on smaller datasets, whereas output‑level aggregation performs best with more data. Several configurations of LC‑SLab also outperform existing land cover products, highlighting the framework's practical utility.

Abstract:
Joint RGB‑infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross‑modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross‑modal Contrastive Learning (PCCL), a self‑supervised contrastive learning strategy that constructs a unified cross‑modal feature space. PCCL employs a frozen pre‑trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared‑visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross‑modal alignment and inter‑class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross‑modal learning, we introduce MVIP, the most comprehensive visible‑infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.

Abstract:
Referential Video Object Segmentation (RVOS) aims to segment all objects in a video that match a given natural language description, bridging the gap between vision and language understanding. Recent work, such as Sa2VA, combines Large Language Models (LLMs) with SAM~2, leveraging the strong video reasoning capability of LLMs to guide video segmentation. In this work, we present a training‑free framework that substantially improves Sa2VA's performance on the RVOS task. Our method introduces two key components: (1) a Video‑Language Checker that explicitly verifies whether the subject and action described in the query actually appear in the video, thereby reducing false positives; and (2) a Key‑Frame Sampler that adaptively selects informative frames to better capture both early object appearances and long‑range temporal context. Without any additional training, our approach achieves a J&F score of 64.14% on the MeViS test set, ranking 2nd place in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

Abstract:
Recent research on representation learning has proved the merits of multi‑modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain‑and‑finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi‑modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large‑scale dataset for multi‑modal pretraining, called ImageNeXt, which contains five popular visual modalities. 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the ImageNeXt. For the first time, we introduce a universal multi‑modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state‑of‑the‑art records on a wide range of multi‑modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI‑360.

Abstract:
Creating high‑fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time‑consuming and labor‑intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large‑scale, high‑quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,431 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi‑view multi‑modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open‑sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.

Abstract:
Complex Video Object Segmentation (VOS) presents significant challenges in accurately segmenting objects across frames, especially in the presence of small and similar targets, frequent occlusions, rapid motion, and complex interactions. In this report, we present our solution for the LSVOS 2025 VOS Track based on the SAM2 framework. We adopt a pseudo‑labeling strategy during training: a trained SAM2 checkpoint is deployed within the SAM2Long framework to generate pseudo labels for the MOSE test set, which are then combined with existing data for further training. For inference, the SAM2Long framework is employed to obtain our primary segmentation results, while an open‑source SeC model runs in parallel to produce complementary predictions. A cascaded decision mechanism dynamically integrates outputs from both models, exploiting the temporal stability of SAM2Long and the concept‑level robustness of SeC. Benefiting from pseudo‑label training and cascaded multi‑model inference, our approach achieves a J\&F score of 0.8616 on the MOSE test set ‑‑ +1.4 points over our SAM2Long baseline ‑‑ securing the 2nd place in the LSVOS 2025 VOS Track, and demonstrating strong robustness and accuracy in long, complex video segmentation scenarios.

Abstract:
Vision Transformers (ViTs) achieve state‑of‑the‑art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early‑Pruning), a hybrid token‑reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN‑based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early‑exits to remove high‑confident supertokens, lowering computational load. We evaluate our method on high‑resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT‑Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

Abstract:
Self‑supervised learning through masked autoencoders has attracted great attention for remote sensing (RS) foundation model (FM) development, enabling improved representation learning across diverse sensors and downstream tasks. However, existing RS FMs often either suffer from substantial computational complexity during both training and inference or exhibit limited representational capacity. These issues restrict their practical applicability in RS. To address this limitation, we propose an adaptation for enhancing the efficiency of RS FMs by integrating the Soft mixture‑of‑experts (MoE) mechanism into the FM. The integration of Soft MoEs into the FM allows modality‑specific expert specialization alongside shared cross‑sensor representation learning. To demonstrate the effectiveness of our adaptation, we apply it on the Cross‑Sensor Masked Autoencoder (CSMAE) model, resulting in the Cross‑Sensor Mixture‑of‑Experts (CSMoE) model. In addition, we introduce a thematic‑climatic descriptor‑driven sampling strategy for the construction of a representative and diverse training set to train our CSMoE model. Extensive experiments on scene classification, semantic segmentation, and content‑based image retrieval demonstrate that our adaptation yields a reduction in computational requirements while maintaining or improving representational performance. Compared to state‑of‑the‑art RS FMs, CSMoE achieves a superior trade‑off between representational capacity, accuracy, and computational efficiency. On average, CSMoE achieves more than twice the computational efficiency of existing RS FMs, while maintaining competitive performance across all experiments. These results show the effectiveness of the proposed adaptation for creating computationally efficient RS FMs. The code for the model, the training set creation, and the model weights will be available at https://git.tu‑berlin.de/rsim/csmoe.

Abstract:
Few‑Shot 3D Point Cloud Semantic Segmentation (FS‑PCS) aims to predict per‑point labels for an unlabeled point cloud, given only a few labeled examples. To extract discriminative representations from the limited labeled set, existing methods have constructed prototypes using algorithms such as farthest point sampling (FPS). However, we point out that this convention has undesirable effects as performance fluctuates depending on sampling, while the prototype generation process remains underexplored in the field. This motivates us to investigate an advanced prototype generation method based on attention mechanism. Despite its potential, we found that vanilla attention module suffers from the distributional gap between prototypical tokens and support features. To overcome this, we propose White Aggregation and Restoration Module (WARM), which resolves the misalignment by sandwiching cross‑attention between whitening and coloring transformations. Specifically, whitening aligns the features to tokens before the attention process, and coloring subsequently restores the original distribution to the attended tokens. This simple yet effective design enables robust attention, thereby generating prototypes that capture the semantic relationships in support features. WARM achieves state‑of‑the‑art performance with a significant margin on FS‑PCS benchmarks, and demonstrates its effectiveness through extensive experiments.

Abstract:
Unsupervised domain adaptation (UDA) for semantic segmentation seeks to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary self‑supervised tasks such as contrastive learning have enhanced feature discriminability, masked modeling remains underexplored due to architectural constraints and misaligned objectives. We propose Masked Representation Modeling (MRM), an auxiliary task that performs representation masking and reconstruction directly in the latent space. Unlike prior masked modeling methods that reconstruct low‑level signals (e.g., pixels or visual tokens), MRM targets high‑level semantic features, aligning its objective with segmentation and integrating seamlessly into standard architectures like DeepLab and DAFormer. To support efficient reconstruction, we design a lightweight auxiliary module, Rebuilder, which is jointly trained with the segmentation network but removed during inference, introducing zero test‑time overhead. Extensive experiments demonstrate that MRM consistently improves segmentation performance across diverse architectures and UDA benchmarks. When integrated with four representative baselines, MRM achieves an average gain of +2.3 mIoU on GTA \rightarrow Cityscapes and +2.8 mIoU on Cityscapes \rightarrow Synthia, establishing it as a simple, effective, and generalizable strategy for unsupervised domain‑adaptive semantic segmentation.

Abstract:
The industrial insertion of flexible flat cables (FFCs) into receptacles presents a significant challenge owing to the need for submillimeter precision when handling the deformable cables. In manufacturing processes, FFC insertion with robotic manipulators often requires laborious human‑guided trajectory generation. While Reinforcement Learning (RL) offers a solution to automate this task without modeling complex properties of FFCs, the nondeterminism caused by the deformability of FFCs requires significant efforts and time on training. Moreover, training directly in a real environment is dangerous as industrial robots move fast and possess no safety measure. We propose an RL algorithm for FFC insertion that leverages a foundation model‑based real‑to‑sim approach to reduce the training time and eliminate the risk of physical damages to robots and surroundings. Training is done entirely in simulation, allowing for random exploration without the risk of physical damages. Sim‑to‑real transfer is achieved through semantic segmentation masks which leave only those visual features relevant to the insertion tasks such as the geometric and spatial information of the cables and receptacles. To enhance generality, we use a foundation model, Segment Anything Model 2 (SAM2). To eleminate human intervention, we employ a Vision‑Language Model (VLM) to automate the initial prompting of SAM2 to find segmentation masks. In the experiments, our method exhibits zero‑shot capabilities, which enable direct deployments to real environments without fine‑tuning.

Abstract:
Recently, query‑based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross‑modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emphquery selection bias. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra‑frame interaction query for spatial relations, and an inter‑frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion‑aware aggregation modules that enhance object token representations: Intra‑frame Interaction Aggregation incorporates position‑aware interactions among objects within a single frame, while Inter‑frame Motion Aggregation leverages trajectory‑guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion‑aware aggregation modules.

Abstract:
Purpose: Natural orifice surgeries minimize the need for incisions and reduce the recovery time compared to open surgery; however, they require a higher level of expertise due to visualization and orientation challenges. We propose a perception pipeline for these surgeries that allows semantic scene understanding. Methods: We bring learning‑based segmentation, depth estimation, and 3D reconstruction modules together to create real‑time segmented maps of the surgical scenes. Additionally, we use registration with robot poses to solve the scale ambiguity of mapping from monocular images, and allow the use of semantically informed real‑time reconstructions in robotic surgeries. Results: We achieve sub‑milimeter reconstruction accuracy based on average one‑sided Chamfer distances, average pose registration RMSE of 0.9 mm, and an estimated scale within 2% of ground truth. Conclusion: We present a modular perception pipeline, integrating semantic segmentation with real‑time monocular SLAM for natural orifice surgeries. This pipeline offers a promising solution for scene understanding that can facilitate automation or surgeon guidance.

Abstract:
In the autonomous driving area synthetic data is crucial for cover specific traffic scenarios which autonomous vehicle must handle. This data commonly introduces domain gap between synthetic and real domains. In this paper we deploy data augmentation to generate custom traffic scenarios with VRUs in order to improve pedestrian recognition. We provide a pipeline for augmentation of the Cityscapes dataset with virtual pedestrians. In order to improve augmentation realism of the pipeline we reveal a novel generative network architecture for adversarial learning of the data‑set lighting conditions. We also evaluate our approach on the tasks of semantic and instance segmentation.

Abstract:
X‑ray angiography is widely used in cardiac interventions to visualize coronary vessels, assess integrity, detect stenoses and guide treatment. We propose a framework for reconstructing 3D vessel trees from biplanar X‑ray images which are extracted from two X‑ray videos captured at different C‑arm angles. The proposed framework consists of three main components: image segmentation, motion phase matching, and 3D reconstruction. An automatic video segmentation method for X‑ray angiography to enable semantic segmentation for image segmentation and motion phase matching. The goal of the motion phase matching is to identify a pair of X‑ray images that correspond to a similar respiratory and cardiac motion phase to reduce errors in 3D reconstruction. This is achieved by tracking a stationary object such as a catheter or lead within the X‑ray video. The semantic segmentation approach assigns different labels to different object classes enabling accurate differentiation between blood vessels, balloons, and catheters. Once a suitable image pair is selected, key anatomical landmarks (vessel branching points and endpoints) are matched between the two views using a heuristic method that minimizes reconstruction errors. This is followed by a novel geometric reconstruction algorithm to generate the 3D vessel tree. The algorithm computes the 3D vessel centrelines by determining the intersection of two 3D surfaces. Compared to traditional methods based on epipolar constraints, the proposed approach simplifies there construction workflow and improves overall accuracy. We trained and validated our segmentation method on 62 X‑ray angiography video sequences. On the test set, our method achieved a segmentation accuracy of 0.703. The 3D reconstruction framework was validated by measuring the reconstruction error of key anatomical landmarks, achieving a reprojection errors of 0.62mm +/‑ 0.38mm.

Abstract:
With the growing deployment of autonomous driving agents, the detection and segmentation of road obstacles have become critical to ensure safe autonomous navigation. However, existing road‑obstacle segmentation methods are applied on individual frames, overlooking the temporal nature of the problem, leading to inconsistent prediction maps between consecutive frames. In this work, we demonstrate that the road‑obstacle segmentation task is inherently temporal, since the segmentation maps for consecutive frames are strongly correlated. To address this, we curate and adapt four evaluation benchmarks for road‑obstacle video segmentation and evaluate 11 state‑of‑the‑art image‑ and video‑based segmentation methods on these benchmarks. Moreover, we introduce two strong baseline methods based on vision foundation models. Our approach establishes a new state‑of‑the‑art in road‑obstacle video segmentation for long‑range video sequences, providing valuable insights and direction for future research.

Abstract:
Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning‑based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low‑level visual cues, such as texture and contrast, and struggle to capture the high‑level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine‑grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph‑guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high‑level semantics and low‑level details through successive modules for scene graph representation, hierarchical aggregation, and graph‑driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state‑of‑the‑art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low‑light object detection, semantic segmentation, and medical image fusion.

Abstract:
Few‑shot 3D point cloud semantic segmentation aims to segment novel categories using a minimal number of annotated support samples. While existing prototype‑based methods have shown promise, they are constrained by two critical challenges: (1) Intra‑class Diversity, where a prototype's limited representational capacity fails to cover a class's full variations, and (2) Inter‑set Inconsistency, where prototypes derived from the support set are misaligned with the query feature space. Motivated by the powerful generative capability of diffusion model, we re‑purpose its pre‑trained conditional encoder to provide a novel source of generalizable features for expanding the prototype's representational range. Under this setup, we introduce the Prototype Expansion Network (PENet), a framework that constructs big‑capacity prototypes from two complementary feature sources. PENet employs a dual‑stream learner architecture: it retains a conventional fully supervised Intrinsic Learner (IL) to distill representative features, while introducing a novel Diffusion Learner (DL) to provide rich generalizable features. The resulting dual prototypes are then processed by a Prototype Assimilation Module (PAM), which adopts a novel push‑pull cross‑guidance attention block to iteratively align the prototypes with the query space. Furthermore, a Prototype Calibration Mechanism (PCM) regularizes the final big capacity prototype to prevent semantic drift. Extensive experiments on the S3DIS and ScanNet datasets demonstrate that PENet significantly outperforms state‑of‑the‑art methods across various few‑shot settings.

Abstract:
Medical image segmentation grapples with challenges including multi‑scale lesion variability, ill‑defined tissue boundaries, and computationally intensive processing demands. This paper proposes the DyGLNet, which achieves efficient and accurate segmentation by fusing global and local features with a dynamic upsampling mechanism. The model innovatively designs a hybrid feature extraction module (SHDCBlock), combining single‑head self‑attention and multi‑scale dilated convolutions to model local details and global context collaboratively. We further introduce a dynamic adaptive upsampling module (DyFusionUp) to realize high‑fidelity reconstruction of feature maps based on learnable offsets. Then, a lightweight design is adopted to reduce computational overhead. Experiments on seven public datasets demonstrate that DyGLNet outperforms existing methods, particularly excelling in boundary accuracy and small‑object segmentation. Meanwhile, it exhibits lower computation complexity, enabling an efficient and reliable solution for clinical medical image analysis. The code will be made available soon.

Abstract:
Accurate, high‑throughput phenotyping is a critical component of modern crop breeding programs, especially for improving traits such as mechanical stability, biomass production, and disease resistance. Stalk diameter is a key structural trait, but traditional measurement methods are labor‑intensive, error‑prone, and unsuitable for scalable phenotyping. In this paper, we present a geometry‑aware computer vision pipeline for estimating stalk diameter from RGB‑D imagery. Our method integrates deep learning‑based instance segmentation, 3D point cloud reconstruction, and axis‑aligned slicing via Principal Component Analysis (PCA) to perform robust diameter estimation. By mitigating the effects of curvature, occlusion, and image noise, this approach offers a scalable and reliable solution to support high‑throughput phenotyping in breeding and agronomic research.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) addresses the challenge of training segmentation models using only image‑level annotations. Existing WSSS methods struggle with precise object boundary localization and focus only on the most discriminative regions. To address these challenges, we propose IG‑CAM (Instance‑Guided Class Activation Mapping), a novel approach that leverages instance‑level cues and influence functions to generate high‑quality, boundary‑aware localization maps. Our method introduces three key innovations: (1) Instance‑Guided Refinement using object proposals to guide CAM generation, ensuring complete object coverage; (2) Influence Function Integration that captures the relationship between training samples and model predictions; and (3) Multi‑Scale Boundary Enhancement with progressive refinement strategies. IG‑CAM achieves state‑of‑the‑art performance on PASCAL VOC 2012 with 82.3% mIoU before post‑processing, improving to 86.6% after CRF refinement, significantly outperforming previous WSSS methods. Extensive ablation studies validate each component's contribution, establishing IG‑CAM as a new benchmark for weakly supervised semantic segmentation.

Abstract:
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single‑chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning‑based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task‑specific models from scratch using relatively small datasets. In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single‑chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine‑tuned for different tasks, and shows logarithmic data scaling of 20% per 10× data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely‑used lossy representations, equivalent to a 10× increase in training data. Finally, we roughly estimate that \approx100M samples (3000 hours) of data are required to fully exploit the potential of GRT.

Abstract:
Point cloud processing as a fundamental task in the field of geomatics and computer vision, has been supporting tasks and applications at different scales from air to ground, including mapping, environmental monitoring, urban/tree structure modeling, automated driving, robotics, disaster responses etc. Due to the rapid development of deep learning, point cloud processing algorithms have nowadays been almost explicitly dominated by learning‑based approaches, most of which are yet transitioned into real‑world practices. Existing surveys primarily focus on the ever‑updating network architecture to accommodate unordered point clouds, largely ignoring their practical values in typical point cloud processing applications, in which extra‑large volume of data, diverse scene contents, varying point density, data modality need to be considered. In this paper, we provide a meta review on deep learning approaches and datasets that cover a selection of critical tasks of point cloud processing in use such as scene completion, registration, semantic segmentation, and modeling. By reviewing a broad range of urban and environmental applications these tasks can support, we identify gaps to be closed as these methods transformed into applications and draw concluding remarks in both the algorithmic and practical aspects of the surveyed methods.

Abstract:
Object 6DoF (6D) pose estimation is essential for robotic perception, especially in industrial settings. It enables robots to interact with the environment and manipulate objects. However, existing benchmarks on object 6D pose estimation primarily use everyday objects with rich textures and low‑reflectivity, limiting model generalization to industrial scenarios where objects are often metallic, texture‑less, and highly reflective. To address this gap, we propose a novel dataset and benchmark namely Industrial Metallic Dataset (IMD), tailored for industrial applications. Our dataset comprises 45 true‑to‑scale industrial components, captured with an RGB‑D camera under natural indoor lighting and varied object arrangements to replicate real‑world conditions. The benchmark supports three tasks, including video object segmentation, 6D pose tracking, and one‑shot 6D pose estimation. We evaluate existing state‑of‑the‑art models, including XMem and SAM2 for segmentation, and BundleTrack and BundleSDF for pose estimation, to assess model performance in industrial contexts. Evaluation results show that our industrial dataset is more challenging than existing household object datasets. This benchmark provides the baseline for developing and comparing segmentation and pose estimation algorithms that better generalize to industrial robotics scenarios.

Abstract:
Bright‑field microscopy, a cost‑effective solution for live‑cell culture, is often the only resource available, along with standard CPUs, for many low‑budget labs. The inherent challenges of bright‑field images ‑‑ their noisiness, low contrast, and dynamic morphology ‑‑ coupled with a lack of GPU resources and complex software interfaces, hinder the desired research output. This article presents a novel microscopy image analysis framework designed for low‑budget labs equipped with a standard CPU desktop. The Python‑based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label‑free data, requiring no manually annotated training data or training phase. It is accessible via a user‑friendly, cross‑platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end‑to‑end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular architecture supports easy maintenance and flexible integration while supporting both single‑image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU‑based platform highlights its significant potential for basic research and clinical applications ‑‑ particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility

Abstract:
Multimodal learning has shown significant performance boost compared to ordinary unimodal models across various domains. However, in real‑world scenarios, multimodal signals are susceptible to missing because of sensor failures and adverse weather conditions, which drastically deteriorates models' operation and performance. Generative models such as AutoEncoder (AE) and Generative Adversarial Network (GAN) are intuitive solutions aiming to reconstruct missing modality from available ones. Yet, their efficacy in remote sensing semantic segmentation remains underexplored. In this paper, we first examine the limitations of existing generative approaches in handling the heterogeneity of multimodal remote sensing data. They inadequately capture semantic context in complex scenes with large intra‑class and small inter‑class variation. In addition, traditional generative models are susceptible to heavy dependence on the dominant modality, introducing bias that affects model robustness under missing modality conditions. To tackle these limitations, we propose a novel Generative‑Enhanced MultiModal learning Network (GEMMNet) with three key components: (1) Hybrid Feature Extractor (HyFEx) to effectively learn modality‑specific representations, (2) Hybrid Fusion with Multiscale Awareness (HyFMA) to capture modality‑synergistic semantic context across scales and (3) Complementary Loss (CoLoss) scheme to alleviate the inherent bias by encouraging consistency across modalities and tasks. Our method, GEMMNet, outperforms both generative baselines AE, cGAN (conditional GAN), and state‑of‑the‑art non‑generative approaches ‑ mmformer and shaspec ‑ on two challenging semantic segmentation remote sensing datasets (Vaihingen and Potsdam). Source code is made available.

Abstract:
Timely assessment of structural damage is critical for disaster response and recovery. However, most prior work in natural disaster analysis relies on 2D imagery, which lacks depth, suffers from occlusions, and provides limited spatial context. 3D semantic segmentation offers a richer alternative, but existing 3D benchmarks focus mainly on urban or indoor scenes, with little attention to disaster‑affected areas. To address this gap, we present 3DAeroRelief‑‑the first 3D benchmark dataset specifically designed for post‑disaster assessment. Collected using low‑cost unmanned aerial vehicles (UAVs) over hurricane‑damaged regions, the dataset features dense 3D point clouds reconstructed via Structure‑from‑Motion and Multi‑View Stereo techniques. Semantic annotations were produced through manual 2D labeling and projected into 3D space. Unlike existing datasets, 3DAeroRelief captures 3D large‑scale outdoor environments with fine‑grained structural damage in real‑world disaster contexts. UAVs enable affordable, flexible, and safe data collection in hazardous areas, making them particularly well‑suited for emergency scenarios. To demonstrate the utility of 3DAeroRelief, we evaluate several state‑of‑the‑art 3D segmentation models on the dataset to highlight both the challenges and opportunities of 3D scene understanding in disaster response. Our dataset serves as a valuable resource for advancing robust 3D vision systems in real‑world applications for post‑disaster scenarios.

Abstract:
Open‑vocabulary semantic segmentation enables models to recognize and segment objects from arbitrary natural language descriptions, offering the flexibility to handle novel, fine‑grained, or functionally defined categories beyond fixed label sets. While this capability is crucial for large‑scale urban point clouds that support applications such as digital twins, smart city management, and urban analytics, it remains largely unexplored in this domain. The main obstacles are the frequent absence of high‑quality, well‑aligned multi‑view imagery in large‑scale urban point cloud datasets and the poor generalization of existing three‑dimensional (3D) segmentation pipelines across diverse urban environments with substantial variation in geometry, scale, and appearance. To address these challenges, we present OpenUrban3D, the first 3D open‑vocabulary semantic segmentation framework for large‑scale urban scenes that operates without aligned multi‑view images, pre‑trained point cloud segmentation networks, or manual annotations. Our approach generates robust semantic features directly from raw point clouds through multi‑view, multi‑granularity rendering, mask‑level vision‑language feature extraction, and sample‑balanced fusion, followed by distillation into a 3D backbone model. This design enables zero‑shot segmentation for arbitrary text queries while capturing both semantic richness and geometric priors. Extensive experiments on large‑scale urban benchmarks, including SensatUrban and SUM, show that OpenUrban3D achieves significant improvements in both segmentation accuracy and cross‑scene generalization over existing methods, demonstrating its potential as a flexible and scalable solution for 3D urban scene understanding.

Abstract:
Isolated Sign Language Recognition (ISLR) approaches primarily rely on RGB data or signer pose information. However, combining these modalities often results in the loss of crucial details, such as hand shape and orientation, due to imprecise representations like bounding boxes. Therefore, we propose the ISLR system SegSLR, which combines RGB and pose information through promptable zero‑shot video segmentation. Given the rough localization of the hands and the signer's body from pose information, we segment the respective parts through the video to maintain all relevant shape information. Subsequently, the segmentations focus the processing of the RGB data on the most relevant body parts for ISLR. This effectively combines RGB and pose information. Our evaluation on the complex ChaLearn249 IsoGD dataset shows that SegSLR outperforms state‑of‑the‑art methods. Furthermore, ablation studies indicate that SegSLR strongly benefits from focusing on the signer's body and hands, justifying our design choices.

Abstract:
Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource‑constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT‑based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder‑decoder pipelines. We introduce I‑Segmenter, the first fully integer‑only ViT segmentation framework. Building on the Segmenter architecture, I‑Segmenter systematically replaces floating‑point operations with integer‑only counterparts. To further stabilize both training and inference, we propose λ‑ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long‑tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer‑only execution throughout the computational graph. Extensive experiments show that I‑Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one‑shot PTQ with a single calibration image, I‑Segmenter delivers competitive accuracy, underscoring its practicality for real‑world deployment.

Abstract:
Multi‑human parsing is the task of segmenting human body parts while associating each part to the person it belongs to, combining instance‑level and part‑level information for fine‑grained human understanding. In this work, we demonstrate that, while state‑of‑the‑art approaches achieved notable results on public datasets, they struggle considerably in segmenting people with overlapping bodies. From the intuition that overlapping people may appear separated from a different point of view, we propose a novel training framework exploiting multi‑view information to improve multi‑human parsing models under occlusions. Our method integrates such knowledge during the training process, introducing a novel approach based on weak supervision on human instances and a multi‑view consistency loss. Given the lack of suitable datasets in the literature, we propose a semi‑automatic annotation strategy to generate human instance segmentation masks from multi‑view RGB+D data and 3D human skeletons. The experiments demonstrate that the approach can achieve up to a 4.20% relative improvement on human parsing over the baseline model in occlusion scenarios.

Abstract:
Tracking cells and detecting mitotic events in time‑lapse microscopy image sequences is a crucial task in biomedical research. However, it remains highly challenging due to dividing objects, low signal‑tonoise ratios, indistinct boundaries, dense clusters, and the visually similar appearance of individual cells. Existing deep learning‑based methods rely on manually labeled datasets for training, which is both costly and time‑consuming. Moreover, their generalizability to unseen datasets remains limited due to the vast diversity of microscopy data. To overcome these limitations, we propose a zero‑shot cell tracking framework by integrating Segment Anything 2 (SAM2), a large foundation model designed for general image and video segmentation, into the tracking pipeline. As a fully‑unsupervised approach, our method does not depend on or inherit biases from any specific training dataset, allowing it to generalize across diverse microscopy datasets without finetuning. Our approach achieves competitive accuracy in both 2D and large‑scale 3D time‑lapse microscopy videos while eliminating the need for dataset‑specific adaptation.

Abstract:
We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three‑step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random‑access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low‑dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero‑shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles ‑‑ akin to an LLM‑like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state‑of‑the‑art optical flow, self‑supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.

Abstract:
Generalized zero‑shot semantic segmentation of 3D point clouds aims to classify each point into both seen and unseen classes. A significant challenge with these models is their tendency to make biased predictions, often favoring the classes encountered during training. This problem is more pronounced in 3D applications, where the scale of the training data is typically smaller than in image‑based tasks. To address this problem, we propose a novel method called E3DPC‑GZSL, which reduces overconfident predictions towards seen classes without relying on separate classifiers for seen and unseen data. E3DPC‑GZSL tackles the overconfidence problem by integrating an evidence‑based uncertainty estimator into a classifier. This estimator is then used to adjust prediction probabilities using a dynamic calibrated stacking factor that accounts for pointwise prediction uncertainty. In addition, E3DPC‑GZSL introduces a novel training strategy that improves uncertainty estimation by refining the semantic space. This is achieved by merging learnable parameters with text‑derived features, thereby improving model optimization for unseen data. Extensive experiments demonstrate that the proposed approach achieves state‑of‑the‑art performance on generalized zero‑shot semantic segmentation datasets, including ScanNet v2 and S3DIS.

Abstract:
Large Language Models (LLMs) often produce monolithic text that is hard to edit in parts, which can slow down collaborative workflows. We present componentization, an approach that decomposes model outputs into modular, independently editable units while preserving context. We describe Modular and Adaptable Output Decomposition (MAOD), which segments responses into coherent components and maintains links among them, and we outline the Component‑Based Response Architecture (CBRA) as one way to implement this idea. Our reference prototype, MAODchat, uses a microservices design with state‑machine‑based decomposition agents, vendor‑agnostic model adapters, and real‑time component manipulation with recomposition. In an exploratory study with four participants from academic, engineering, and product roles, we observed that component‑level editing aligned with several common workflows and enabled iterative refinement and selective reuse. Participants also mentioned possible team workflows. Our contributions are: (1) a definition of componentization for transforming monolithic outputs into manipulable units, (2) CBRA and MAODchat as a prototype architecture, (3) preliminary observations from a small user study, (4) MAOD as an algorithmic sketch for semantic segmentation, and (5) example Agent‑to‑Agent protocols for automated decomposition. We view componentization as a promising direction for turning passive text consumption into more active, component‑level collaboration.

Abstract:
High‑quality data plays an indispensable role in the era of large models, but the use of unauthorized data for model training greatly damages the interests of data owners. To overcome this threat, several unlearnable methods have been proposed, which generate unlearnable examples (UEs) by compromising the training availability of data. Clearly, due to unknown training purposes and the powerful representation learning capabilities of existing models, these data are expected to be unlearnable for models across multiple tasks, i.e., they will not help improve the model's performance. However, unexpectedly, we find that on the multi‑task dataset Taskonomy, UEs still perform well in tasks such as semantic segmentation, failing to exhibit cross‑task unlearnability. This phenomenon leads us to question: How far are we from attaining truly unlearnable examples? We attempt to answer this question from the perspective of model optimization. To this end, we observe the difference in the convergence process between clean and poisoned models using a simple model architecture. Subsequently, from the loss landscape we find that only a part of the critical parameter optimization paths show significant differences, implying a close relationship between the loss landscape and unlearnability. Consequently, we employ the loss landscape to explain the underlying reasons for UEs and propose Sharpness‑Aware Learnability (SAL) to quantify the unlearnability of parameters based on this explanation. Furthermore, we propose an Unlearnable Distance (UD) to measure the unlearnability of data based on the SAL distribution of parameters in clean and poisoned models. Finally, we conduct benchmark tests on mainstream unlearnable methods using the proposed UD, aiming to promote community awareness of the capability boundaries of existing unlearnable methods.

Abstract:
Few‑shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image‑level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, i.e., irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object‑level information. Target identification in the general objects is more valid than in the entire image, especially in the low‑data regime. Inspired by this, we design an Object‑level Correlation Network (OCNet) by establishing the object‑level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high‑level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object‑level correlation by allocating the target prototypes to match the general object feature. The generated object‑level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL‑5^i and COCO‑20^i show that our model achieves the state‑of‑the‑art performance.

Abstract:
3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high‑level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre‑alignment, weakening object‑level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine‑grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large‑scale pre‑alignment between 3D‑text or 3D‑images. Specifically, we introduce Object‑centric Discriminative Representation (OcDR), which learns object‑centric tokens that capture target semantics and scene relations under a hard negative‑aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic‑level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM‑inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object‑centric reasoning for robust 3D understanding.

Abstract:
Robotic systems demand accurate and comprehensive 3D environment perception, requiring simultaneous capture of photo‑realistic appearance (optical), precise layout shape (geometric), and open‑vocabulary scene understanding (semantic). Existing methods typically achieve only partial fulfillment of these requirements while exhibiting optical blurring, geometric irregularities, and semantic ambiguities. To address these challenges, we propose OmniMap. Overall, OmniMap represents the first online mapping framework that simultaneously captures optical, geometric, and semantic scene attributes while maintaining real‑time performance and model compactness. At the architectural level, OmniMap employs a tightly coupled 3DGS‑Voxel hybrid representation that combines fine‑grained modeling with structural stability. At the implementation level, OmniMap identifies key challenges across different modalities and introduces several innovations: adaptive camera modeling for motion blur and exposure compensation, hybrid incremental representation with normal constraints, and probabilistic fusion for robust instance‑level understanding. Extensive experiments show OmniMap's superior performance in rendering fidelity, geometric accuracy, and zero‑shot semantic segmentation compared to state‑of‑the‑art methods across diverse scenes. The framework's versatility is further evidenced through a variety of downstream applications, including multi‑domain scene Q&A, interactive editing, perception‑guided manipulation, and map‑assisted navigation.

Abstract:
This article presents UrbanTwin datasets, high‑fidelity, realistic replicas of three public roadside lidar datasets: LUMPI, V2X‑Real‑IC, and TUMTraf‑I. Each UrbanTwin dataset contains 10K annotated frames corresponding to one of the public datasets. Annotations include 3D bounding boxes, instance segmentation labels, and tracking IDs for six object classes, along with semantic segmentation labels for nine classes. These datasets are synthesized using emulated lidar sensors within realistic digital twins, modeled based on surrounding geometry, road alignment at lane level, and the lane topology and vehicle movement patterns at intersections of the actual locations corresponding to each real dataset. Due to the precise digital twin modeling, the synthetic datasets are well aligned with their real counterparts, offering strong standalone and augmentative value for training deep learning models on tasks such as 3D object detection, tracking, and semantic and instance segmentation. We evaluate the alignment of the synthetic replicas through statistical and structural similarity analysis with real data, and further demonstrate their utility by training 3D object detection models solely on synthetic data and testing them on real, unseen data. The high similarity scores and improved detection performance, compared to models trained on real data, indicate that the UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity. In addition, the digital twins can be adapted to test custom scenarios by modifying the design and dynamics of the simulations. To our knowledge, these are the first digitally synthesized datasets that can replace in‑domain real‑world datasets for lidar perception tasks. UrbanTwin datasets are publicly available at https://dataverse.harvard.edu/dataverse/ucf‑ut.

Abstract:
Bioprinting is a rapidly advancing field that offers a transformative approach to fabricating tissue and organ models through the precise deposition of cell‑laden bioinks. Ensuring the fidelity and consistency of printed structures in real‑time remains a core challenge, particularly under constraints imposed by limited imaging data and resource‑constrained embedded hardware. Semantic segmentation of the extrusion process, differentiating between nozzle, extruded bioink, and surrounding background, enables in situ monitoring critical to maintaining print quality and biological viability. In this work, we introduce a lightweight semantic segmentation framework tailored for real‑time bioprinting applications. We present a novel, manually annotated dataset comprising 787 RGB images captured during the bioprinting process, labeled across three classes: nozzle, bioink, and background. To achieve fast and efficient inference suitable for integration with bioprinting systems, we propose a BioLite U‑Net architecture that leverages depthwise separable convolutions to drastically reduce computational load without compromising accuracy. Our model is benchmarked against MobileNetV2 and MobileNetV3‑based segmentation baselines using mean Intersection over Union (mIoU), Dice score, and pixel accuracy. All models were evaluated on a Raspberry Pi 4B to assess real‑world feasibility. The proposed BioLite U‑Net achieves an mIoU of 92.85% and a Dice score of 96.17%, while being over 1300x smaller than MobileNetV2‑DeepLabV3+. On‑device inference takes 335 ms per frame, demonstrating near real‑time capability. Compared to MobileNet baselines, BioLite U‑Net offers a superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed‑loop bioprinting systems.

Abstract:
The growing popularity of robotic minimally invasive surgeries has made deep learning‑based surgical training a key area of research. A thorough understanding of the surgical scene components is crucial, which semantic segmentation models can help achieve. However, most existing work focuses on surgical tools and overlooks anatomical objects. Additionally, current state‑of‑the‑art (SOTA) models struggle to balance capturing high‑level contextual features and low‑level edge features. We propose a Feature‑Adaptive Spatial Localization model (FASL‑Seg), designed to capture features at multiple levels of detail through two distinct processing streams, namely a Low‑Level Feature Projection (LLFP) and a High‑Level Feature Projection (HLFP) stream, for varying feature resolutions ‑ enabling precise segmentation of anatomy and surgical instruments. We evaluated FASL‑Seg on surgical segmentation benchmark datasets EndoVis18 and EndoVis17 on three use cases. The FASL‑Seg model achieves a mean Intersection over Union (mIoU) of 72.71% on parts and anatomy segmentation in EndoVis18, improving on SOTA by 5%. It further achieves a mIoU of 85.61% and 72.78% in EndoVis18 and EndoVis17 tool type segmentation, respectively, outperforming SOTA overall performance, with comparable per‑class SOTA results in both datasets and consistent performance in various classes for anatomy and instruments, demonstrating the effectiveness of distinct processing streams for varying feature resolutions.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual‑language fusion that struggles with complex, compositional descriptions. In this paper, we propose PARSE‑VOS, a novel, training‑free framework powered by Large Language Models (LLMs), for a hierarchical, coarse‑to‑fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio‑temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two‑stage reasoning process: it first performs coarse‑grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine‑grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. PARSE‑VOS achieved state‑of‑the‑art performance on three major benchmarks: Ref‑YouTube‑VOS, Ref‑DAVIS17, and MeViS.

Abstract:
Instance segmentation is essential for numerous computer vision applications, including robotics, human‑computer interaction, and autonomous driving. Currently, popular models bring impressive performance in instance segmentation by training with a large number of human annotations, which are costly to collect. For this reason, we present a new framework that efficiently and effectively segments objects without the need for human annotations. Firstly, a MultiCut algorithm is applied to self‑supervised features for coarse mask segmentation. Then, a mask filter is employed to obtain high‑quality coarse masks. To train the segmentation network, we compute a novel superpixel‑guided mask loss, comprising hard loss and soft loss, with high‑quality coarse masks and superpixels segmented from low‑level image features. Lastly, a self‑training process with a new adaptive loss is proposed to improve the quality of predicted masks. We conduct experiments on public datasets in instance segmentation and object detection to demonstrate the effectiveness of the proposed framework. The results show that the proposed framework outperforms previous state‑of‑the‑art methods.

Abstract:
Recent advances in semantic segmentation of multi‑modal remote sensing images have significantly improved the accuracy of tree cover mapping, supporting applications in urban planning, forest monitoring, and ecological assessment. Integrating data from multiple modalities‑such as optical imagery, light detection and ranging (LiDAR), and synthetic aperture radar (SAR)‑has shown superior performance over single‑modality methods. However, these data are often acquired days or even months apart, during which various changes may occur, such as vegetation disturbances (e.g., logging, and wildfires) and variations in imaging quality. Such temporal misalignments introduce cross‑modal uncertainty, especially in high‑resolution imagery, which can severely degrade segmentation accuracy. To address this challenge, we propose MURTreeFormer, a novel multi‑modal segmentation framework that mitigates and leverages aleatoric uncertainty for robust tree cover mapping. MURTreeFormer treats one modality as primary and others as auxiliary, explicitly modeling patch‑level uncertainty in the auxiliary modalities via a probabilistic latent representation. Uncertain patches are identified and reconstructed from the primary modality's distribution through a VAE‑based resampling mechanism, producing enhanced auxiliary features for fusion. In the decoder, a gradient magnitude attention (GMA) module and a lightweight refinement head (RH) are further integrated to guide attention toward tree‑like structures and to preserve fine‑grained spatial details. Extensive experiments on multi‑modal datasets from Shanghai and Zurich demonstrate that MURTreeFormer significantly improves segmentation performance and effectively reduces the impact of temporally induced aleatoric uncertainty.

Abstract:
Semantic segmentation in real‑world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task‑specific retraining that must be repeated as the guidelines evolve. Although recent open‑vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph‑length guidelines that specify intricate segmentation rules. To address this, we introduce a multi‑agent, training‑free framework that coordinates general‑purpose vision‑language models within an iterative Worker‑Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline‑consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state‑of‑the‑art baselines, demonstrating strong generalization and instruction adherence.

Abstract:
Semantic segmentation of overhead remote sensing imagery enables applications in mapping, urban planning, and disaster response. State‑of‑the‑art segmentation networks are typically developed and tuned on ground‑perspective photographs and do not directly address remote sensing challenges such as extreme scale variation, foreground‑background imbalance, and large image sizes. We explore the incorporation of the differential morphological profile (DMP), a multi‑scale shape extraction method based on grayscale morphology, into modern segmentation networks. Prior studies have shown that the DMP can provide critical shape information to Deep Neural Networks to enable superior detection and classification performance in overhead imagery. In this work, we extend prior DMPNet work beyond classification and object detection by integrating DMP features into three state‑of‑the‑art convolutional and transformer semantic segmentation architectures. We utilize both direct input, which adapts the input stem of feature extraction architectures to accept DMP channels, and hybrid architectures, a dual‑stream design that fuses RGB and DMP encoders. Using the iSAID benchmark dataset, we evaluate a variety of DMP differentials and structuring element shapes to more effectively provide shape information to the model. Our results show that while non‑DMP models generally outperform the direct‑input variants, hybrid DMP consistently outperforms direct‑input and is capable of surpassing a non‑DMP model on mIoU, F1, and Recall.

Abstract:
This research addresses the growing challenge of artificial satellite trail interference in ground‑based astronomical observations by developing an efficient deep learning identification method. With the proliferation of satellite constellations in low Earth orbit, accurate detection of satellite trails has become crucial for preserving astronomical data quality. Using multi‑band photometric survey observational data from the Multi‑channel Photometric Survey Telescope (Mephisto) of Yunnan University , we constructed a specialized dataset of astronomical images containing satellite trails. We propose a novel ASA‑U‑Net model that integrates atrous spatial pyramid pooling with channel attention mechanisms into the U‑Net architecture to effectively capture sparse satellite trail features that traditional semantic segmentation models often miss during downsampling. The model was implemented and validated on actual telescope data, demonstrating superior performance in end‑to‑end detection and marking of satellite trails compared to traditional methods. This approach significantly improves data processing precision without requiring manual parameter adjustments, making it suitable for processing massive nightly survey data and enhancing the quality of astronomical data products.

Abstract:
Historic urban quarters are increasingly shaped by tourism and lifestyle consumption, yet planners often lack scalable evidence on what visitors notice, prefer, and criticize in these environments. This study proposes an AI‑based, multimodal framework to decode tourist perception by combining visual attention, color‑based aesthetic representation, and multidimensional satisfaction. We collect geotagged photos and review texts from a major Chinese platform and assemble a street view image set as a baseline for comparison across 12 historic urban quarters in Shanghai. We train a semantic segmentation model to quantify foregrounded visual elements in tourist‑shared imagery, extract and compare color palettes between social media photos and street views, and apply a multi‑task sentiment classifier to assess satisfaction across four experience dimensions that correspond to activity, physical setting, supporting services, and commercial offerings. Results show that tourist photos systematically foreground key streetscape elements and that the color composition represented on social media can differ from on‑site street views, indicating a perception‑reality gap that varies by quarter. The framework offers an interpretable and transferable approach to diagnose such gaps and to inform heritage management and visitor‑oriented urban design.

Abstract:
Close‑range laser scanning provides detailed 3D captures of forest stands but requires efficient software for processing 3D point cloud data and extracting individual trees. Although recent studies have introduced deep learning methods for tree instance segmentation, these approaches require large annotated datasets and substantial computational resources. As a resource‑efficient alternative, we present a revised version of the treeX algorithm, an unsupervised method that combines clustering‑based stem detection with region growing for crown delineation. While the original treeX algorithm was developed for personal laser scanning (PLS) data, we provide two parameter presets, one for ground‑based laser scanning (stationary terrestrial ‑ TLS and PLS), and one for UAV‑borne laser scanning (ULS). We evaluated the method on six public datasets (FOR‑instance, ForestSemantic, LAUTx, NIBIO MLS, TreeLearn, Wytham Woods) and compared it to six open‑source methods (original treeX, treeiso, RayCloudTools, ForAINet, SegmentAnyTree, TreeLearn). Compared to the original treeX algorithm, our revision reduces runtime and improves accuracy, with instance detection F_1‑score gains of +0.11 to +0.49 for ground‑based data. For ULS data, our preset achieves an F_1‑score of 0.58, whereas the original algorithm fails to segment any correct instances. For TLS and PLS data, our algorithm achieves accuracy similar to recent open‑source methods, including deep learning. Given its algorithmic design, we see two main applications for our method: (1) as a resource‑efficient alternative to deep learning approaches in scenarios where the data characteristics align with the method design (sufficient stem visibility and point density), and (2) for the semi‑automatic generation of labels for deep learning models. To enable broader adoption, we provide an open‑source Python implementation in the pointtree package.

Abstract:
Extracting small objects from remote sensing imagery plays a vital role in various applications, including urban planning, environmental monitoring, and disaster management. While current research primarily focuses on small object detection, instance segmentation for small objects remains underexplored, with no dedicated datasets available. This gap stems from the technical challenges and high costs of pixel‑level annotation for small objects. While the Segment Anything Model (SAM) demonstrates impressive zero‑shot generalization, its performance on small‑object segmentation deteriorates significantly, largely due to the coarse 1/16 feature resolution that causes severe loss of fine spatial details. To this end, we propose SOPSeg, a prompt‑based framework specifically designed for small object segmentation in remote sensing imagery. It incorporates a region‑adaptive magnification strategy to preserve fine‑grained details, and employs a customized decoder that integrates edge prediction and progressive refinement for accurate boundary delineation. Moreover, we introduce a novel prompting mechanism tailored to the oriented bounding boxes widely adopted in remote sensing applications. SOPSeg outperforms existing methods in small object segmentation and facilitates efficient dataset construction for remote sensing tasks. We further construct a comprehensive small object instance segmentation dataset based on SODA‑A, and will release both the model and dataset to support future research.

Abstract:
Acquiring high‑quality instance segmentation data is challenging due to the labor‑intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy‑Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training‑free Dual‑Agent system designed to augment instance segmentation datasets. First, we introduce a Text‑Agent (T‑Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image‑Agent (I‑Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.

Abstract:
Photorealistic and controllable human avatars have gained popularity in the research community thanks to rapid advances in neural rendering, providing fast and realistic synthesis tools. However, a limitation of current solutions is the presence of noticeable blurring. To solve this problem, we propose GaussianGAN, an animatable avatar approach developed for photorealistic rendering of people in real‑time. We introduce a novel Gaussian splatting densification strategy to build Gaussian points from the surface of cylindrical structures around estimated skeletal limbs. Given the camera calibration, we render an accurate semantic segmentation with our novel view segmentation module. Finally, a UNet generator uses the rendered Gaussian splatting features and the segmentation maps to create photorealistic digital avatars. Our method runs in real‑time with a rendering speed of 79 FPS. It outperforms previous methods regarding visual perception and quality, achieving a state‑of‑the‑art results in terms of a pixel fidelity of 32.94db on the ZJU Mocap dataset and 33.39db on the Thuman4 dataset.

Abstract:
High‑resolution LiDAR data plays a critical role in 3D semantic segmentation for autonomous driving, but the high cost of advanced sensors limits large‑scale deployment. In contrast, low‑cost sensors such as 16‑channel LiDAR produce sparse point clouds that degrade segmentation accuracy. To overcome this, we introduce the first end‑to‑end framework that jointly addresses LiDAR super‑resolution (SR) and semantic segmentation. The framework employs joint optimization during training, allowing the SR module to incorporate semantic cues and preserve fine details, particularly for smaller object classes. A new SR loss function further directs the network to focus on regions of interest. The proposed lightweight, model‑based SR architecture uses significantly fewer parameters than existing LiDAR SR approaches, while remaining easily compatible with segmentation networks. Experiments show that our method achieves segmentation performance comparable to models operating on high‑resolution and costly 64‑channel LiDAR data.

Abstract:
Context: Predicting human trajectories is crucial for the safety and reliability of autonomous systems, such as automated vehicles and mobile robots. However, rigorously testing the underlying multimodal Human Trajectory Prediction (HTP) models, which typically use multiple input sources (e.g., trajectory history and environment maps) and produce stochastic outputs (multiple possible future paths), presents significant challenges. The primary difficulty lies in the absence of a definitive test oracle, as numerous future trajectories might be plausible for any given scenario. Objectives: This research presents the application of Metamorphic Testing (MT) as a systematic methodology for testing multimodal HTP systems. We address the oracle problem through metamorphic relations (MRs) adapted for the complexities and stochastic nature of HTP. Methods: We present five MRs, targeting transformations of both historical trajectory data and semantic segmentation maps used as an environmental context. These MRs encompass: 1) label‑preserving geometric transformations (mirroring, rotation, rescaling) applied to both trajectory and map inputs, where outputs are expected to transform correspondingly. 2) Map‑altering transformations (changing semantic class labels, introducing obstacles) with predictable changes in trajectory distributions. We propose probabilistic violation criteria based on distance metrics between probability distributions, such as the Wasserstein or Hellinger distance. Conclusion: This study introduces tool, a MT framework for the oracle‑less testing of multimodal, stochastic HTP systems. It allows for assessment of model robustness against input transformations and contextual changes without reliance on ground‑truth trajectories.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) conducts pixel‑level classification via text‑driven alignment, where the domain discrepancy between base category training and open‑vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision‑language model (VLM)‑based approaches demonstrate commendable performance through pre‑trained multi‑modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X‑Agent, an innovative OVSS framework employing latent semantic‑aware ``agent'' to orchestrate cross‑modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X‑Agent achieves state‑of‑the‑art performance while effectively enhancing the latent semantic saliency.

Abstract:
The emergence of vision language models (VLMs) bridges the gap between vision and language, enabling multimodal understanding beyond traditional visual‑only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the large domain gap and the diversity of RS inputs across tasks, particularly in open‑vocabulary semantic segmentation (OVSS) and referring expression segmentation (RES). Here, we propose a training‑free unified framework, termed DGL‑RSIS, which decouples visual and textual representations and performs visual‑language alignment at both local semantic and global contextual levels. Specifically, a Global‑Local Decoupling (GLD) module decomposes textual inputs into local semantic tokens and global contextual tokens, while image inputs are partitioned into class‑agnostic mask proposals. Then, a Local Visual‑Textual Alignment (LVTA) module adaptively extracts context‑aware visual features from the mask proposals and enriches textual features through knowledge‑guided prompt engineering, achieving OVSS from a local perspective. Furthermore, a Global Visual‑Textual Alignment (GVTA) module employs a global‑enhanced Grad‑CAM mechanism to capture contextual cues for referring expressions, followed by a mask selection module that integrates pixel‑level activations into mask‑level segmentation outputs, thereby achieving RES from a global perspective. Experiments on the iSAID (OVSS) and RRSIS‑D (RES) benchmarks demonstrate that DGL‑RSIS outperforms existing training‑free approaches. Ablation studies further validate the effectiveness of each module. To the best of our knowledge, this is the first unified training‑free framework for RS image segmentation, which effectively transfers the semantic capability of VLMs trained on natural images to the RS domain without additional training.

Abstract:
Class‑Incremental Semantic Segmentation (CISS) requires continuous learning of newly introduced classes while retaining knowledge of past classes. By abstracting mainstream methods into two stages (visual feature extraction and prototype‑feature matching), we identify a more fundamental challenge termed catastrophic semantic entanglement. This phenomenon involves Prototype‑Feature Entanglement caused by semantic misalignment during the incremental process, and Background‑Increment Entanglement due to dynamic data evolution. Existing techniques, which rely on visual feature learning without sufficient cues to distinguish targets, introduce significant noise and errors. To address these issues, we introduce a Language‑inspired Bootstrapped Disentanglement framework (LBD). We leverage the prior class semantics of pre‑trained visual‑language models (e.g., CLIP) to guide the model in autonomously disentangling features through Language‑guided Prototypical Disentanglement and Manifold Mutual Background Disentanglement. The former guides the disentangling of new prototypes by treating hand‑crafted text features as topological templates, while the latter employs multiple learnable prototypes and mask‑pooling‑based supervision for background‑incremental class disentanglement. By incorporating soft prompt tuning and encoder adaptation modifications, we further bridge the capability gap of CLIP between dense and sparse tasks, achieving state‑of‑the‑art performance on both Pascal VOC and ADE20k, particularly in multi‑step scenarios.

Abstract:
Semantic segmentation of 3D LiDAR data plays a pivotal role in autonomous driving. Traditional approaches rely on extensive annotated data for point cloud analysis, incurring high costs and time investments. In contrast, realworld image datasets offer abundant availability and substantial scale. To mitigate the burden of annotating 3D LiDAR point clouds, we propose two crossmodal knowledge distillation methods: Unsupervised Domain Adaptation Knowledge Distillation (UDAKD) and Feature and Semantic‑based Knowledge Distillation (FSKD). Leveraging readily available spatio‑temporally synchronized data from cameras and LiDARs in autonomous driving scenarios, we directly apply a pretrained 2D image model to unlabeled 2D data. Through crossmodal knowledge distillation with known 2D‑3D correspondence, we actively align the output of the 3D network with the corresponding points of the 2D network, thereby obviating the necessity for 3D annotations. Our focus is on preserving modality‑general information while filtering out modality‑specific details during crossmodal distillation. To achieve this, we deploy self‑calibrated convolution on 3D point clouds as the foundation of our domain adaptation module. Rigorous experimentation validates the effectiveness of our proposed methods, consistently surpassing the performance of state‑of‑the‑art approaches in the field.

Abstract:
Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing ‑ the basis for content‑based search in a medical video archive ‑ is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region‑based fully convolutional network for instance‑aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi‑class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.

Abstract:
Localisation of surgical tools constitutes a foundational building block for computer‑assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning‑based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST‑MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST‑MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head‑to‑head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high‑quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.

Abstract:
Visible and infrared image fusion (VIF) is an important multimedia task in computer vision. Most VIF methods focus primarily on optimizing fused image quality. Recent studies have begun incorporating downstream tasks, such as semantic segmentation and object detection, to provide semantic guidance for VIF. However, semantic segmentation requires extensive annotations, while object detection, despite reducing annotation efforts compared with segmentation, faces challenges in highly crowded scenes due to overlapping bounding boxes and occlusion. Moreover, although RGB‑T crowd counting has gained increasing attention in recent years, no studies have integrated VIF and crowd counting into a unified framework. To address these challenges, we propose FusionCounting, a novel multi‑task learning framework that integrates crowd counting into the VIF process. Crowd counting provides a direct quantitative measure of population density with minimal annotation, making it particularly suitable for dense scenes. Our framework leverages both input images and population density information in a mutually beneficial multi‑task design. To accelerate convergence and balance tasks contributions, we introduce a dynamic loss function weighting strategy. Furthermore, we incorporate adversarial training to enhance the robustness of both VIF and crowd counting, improving the model's stability and resilience to adversarial attacks. Experimental results on public datasets demonstrate that FusionCounting not only enhances image fusion quality but also achieves superior crowd counting performance.

Abstract:
Existing LiDAR‑based 3D object detectors typically rely on manually annotated labels for training to achieve good performance. However, obtaining high‑quality 3D labels is time‑consuming and labor‑intensive. To address this issue, recent works explore unsupervised 3D object detection by introducing RGB images as an auxiliary modal to assist pseudo‑box generation. However, these methods simply integrate pseudo‑boxes generated by LiDAR point clouds and RGB images. Yet, such a label‑level fusion strategy brings limited improvements to the quality of pseudo‑boxes, as it overlooks the complementary nature in terms of LiDAR and RGB image data. To overcome the above limitations, we propose a novel data‑level fusion framework that integrates RGB images and LiDAR data at an early stage. Specifically, we utilize vision foundation models for instance segmentation and depth estimation on images and introduce a bi‑directional fusion method, where real points acquire category labels from the 2D space, while 2D pixels are projected onto 3D to enhance real point density. To mitigate noise from depth and segmentation estimations, we propose a local and global filtering method, which applies local radius filtering to suppress depth estimation errors and global statistical filtering to remove segmentation‑induced outliers. Furthermore, we propose a data‑level fusion based dynamic self‑evolution strategy, which iteratively refines pseudo‑boxes under a dense representation, significantly improving localization accuracy. Extensive experiments on the nuScenes dataset demonstrate that the detector trained by our method significantly outperforms that trained by previous state‑of‑the‑art methods with 28.4% mAP on the nuScenes validation benchmark.

Abstract:
Rapid and reliable qualification of advanced materials remains a bottleneck in industrial manufacturing, particularly for heterogeneous structures produced via non‑conventional additive manufacturing processes. This study introduces a novel framework that links microstructure informatics with a range of expert characterization knowledge using customized and hybrid vision‑language representations (VLRs). By integrating deep semantic segmentation with pre‑trained multi‑modal models (CLIP and FLAVA), we encode both visual microstructural data and textual expert assessments into shared representations. To overcome limitations in general‑purpose embeddings, we develop a customized similarity‑based representation that incorporates both positive and negative references from expert‑annotated images and their associated textual descriptions. This allows zero‑shot classification of previously unseen microstructures through a net similarity scoring approach. Validation on an additively manufactured metal matrix composite dataset demonstrates the framework's ability to distinguish between acceptable and defective samples across a range of characterization criteria. Comparative analysis reveals that FLAVA model offers higher visual sensitivity, while the CLIP model provides consistent alignment with the textual criteria. Z‑score normalization adjusts raw unimodal and cross‑modal similarity scores based on their local dataset‑driven distributions, enabling more effective alignment and classification in the hybrid vision‑language framework. The proposed method enhances traceability and interpretability in qualification pipelines by enabling human‑in‑the‑loop decision‑making without task‑specific model retraining. By advancing semantic interoperability between raw data and expert knowledge, this work contributes toward scalable and domain‑adaptable qualification strategies in engineering informatics.

Abstract:
Current methods for 3D semantic segmentation propose training models with limited annotations to address the difficulty of annotating large, irregular, and unordered 3D point cloud data. They usually focus on the 3D domain only, without leveraging the complementary nature of 2D and 3D data. Besides, some methods extend original labels or generate pseudo labels to guide the training, but they often fail to fully use these labels or address the noise within them. Meanwhile, the emergence of comprehensive and adaptable foundation models has offered effective solutions for segmenting 2D data. Leveraging this advancement, we present a novel approach that maximizes the utility of sparsely available 3D annotations by incorporating segmentation masks generated by 2D foundation models. We further propagate the 2D segmentation masks into the 3D space by establishing geometric correspondences between 3D scenes and 2D views. We extend the highly sparse annotations to encompass the areas delineated by 3D masks, thereby substantially augmenting the pool of available labels. Furthermore, we apply confidence‑ and uncertainty‑based consistency regularization on augmentations of the 3D point cloud and select the reliable pseudo labels, which are further spread on the 3D masks to generate more labels. This innovative strategy bridges the gap between limited 3D annotations and the powerful capabilities of 2D foundation models, ultimately improving the performance of 3D weakly supervised segmentation.

Abstract:
Hyperspectral imaging (HSI) offers a transformative sensing modality for Advanced Driver Assistance Systems (ADAS) and autonomous driving (AD) applications, enabling material‑level scene understanding through fine spectral resolution beyond the capabilities of traditional RGB imaging. This paper presents the first comprehensive review of HSI for automotive applications, examining the strengths, limitations, and suitability of current HSI technologies in the context of ADAS/AD. In addition to this qualitative review, we analyze 216 commercially available HSI and multispectral imaging cameras, benchmarking them against key automotive criteria: frame rate, spatial resolution, spectral dimensionality, and compliance with AEC‑Q100 temperature standards. Our analysis reveals a significant gap between HSI's demonstrated research potential and its commercial readiness. Only four cameras meet the defined performance thresholds, and none comply with AEC‑Q100 requirements. In addition, the paper reviews recent HSI datasets and applications, including semantic segmentation for road surface classification, pedestrian separability, and adverse weather perception. Our review shows that current HSI datasets are limited in terms of scale, spectral consistency, the number of spectral channels, and environmental diversity, posing challenges for the development of perception algorithms and the adequate validation of HSI's true potential in ADAS/AD applications. This review paper establishes the current state of HSI in automotive contexts as of 2025 and outlines key research directions toward practical integration of spectral imaging in ADAS and autonomous systems.

Abstract:
Domain Generalized Semantic Segmentation (DGSS) focuses on training a model using labeled data from a source domain, with the goal of achieving robust generalization to unseen target domains during inference. A common approach to improve generalization is to augment the source domain with synthetic data generated by diffusion models (DMs). However, the generated images often contain structural or semantic defects due to training imperfections. Training segmentation models with such flawed data can lead to performance degradation and error accumulation. To address this issue, we propose to integrate inverse evolution layers (IELs) into the generative process. IELs are designed to highlight spatial discontinuities and semantic inconsistencies using Laplacian‑based priors, enabling more effective filtering of undesirable generative patterns. Based on this mechanism, we introduce IELDM, an enhanced diffusion‑based data augmentation framework that can produce higher‑quality images. Furthermore, we observe that the defect‑suppression capability of IELs can also benefit the segmentation network by suppressing artifact propagation. Based on this insight, we embed IELs into the decoder of the DGSS model and propose IELFormer to strengthen generalization capability in cross‑domain scenarios. To further strengthen the model's semantic consistency across scales, IELFormer incorporates a multi‑scale frequency fusion (MFF) module, which performs frequency‑domain analysis to achieve structured integration of multi‑resolution features, thereby improving cross‑scale coherence. Extensive experiments on benchmark datasets demonstrate that our approach achieves superior generalization performance compared to existing methods.

Abstract:
Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general‑purpose primitive. However, many real‑world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today's landscape fragmented across task‑specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state‑space models, AUSM maintains a fixed‑size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube‑VOS 2018 & 2019, MOSE, YouTube‑VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16‑frame sequences.

Abstract:
Obtaining pixel‑level annotations over large spatial extents remains a major bottleneck for deploying machine learning in ecological applications. Here we present a multi‑scale weakly supervised semantic segmentation (WSSS) framework that enables training high‑resolution segmentation models from dense, classification‑based outputs. Our method combines fine‑scale, multi‑label predictions from underwater imagery with broad‑coverage aerial data. We convert these point‑level classifications into coarse supervision masks that can be used to train a semantic segmentation model on Unmanned Aerial Vehicle (UAV) orthophotos. A second training step using the model's own refined predictions is then used to further improve spatial accuracy without requiring additional annotations. We demonstrate the approach on coral reef imagery, enabling large‑area segmentation of coral morphotypes and illustrating its flexibility in integrating new classes. The final model achieves 86.07% pixel accuracy and 52.23% mean Intersection over Union (mIoU) on manually annotated reef zones, demonstrating that accurate large‑scale coral segmentation can be obtained without pixel‑level annotations. By bridging image classification and segmentation across scales and modalities, this method provides an efficient solution for deploying segmentation models in settings where annotations are unavailable and opens opportunities for scalable, efficient monitoring in ecology and beyond.

Abstract:
Achieving realistic hair strand synthesis is essential for creating lifelike digital humans, but producing high‑fidelity hair strand geometry remains a significant challenge. Existing methods require a complex setup for data acquisition, involving multi‑view images captured in constrained studio environments. Additionally, these methods have longer hair volume estimation and strand synthesis times, which hinder efficiency. We introduce PanoHair, a model that estimates head geometry as signed distance fields using knowledge distillation from a pre‑trained generative teacher model for head synthesis. Our approach enables the prediction of semantic segmentation masks and 3D orientations specifically for the hair region of the estimated geometry. Our method is generative and can generate diverse hairstyles with latent space manipulations. For real images, our approach involves an inversion process to infer latent codes and produces visually appealing hair strands, offering a streamlined alternative to complex multi‑view data acquisition setups. Given the latent code, PanoHair generates a clean manifold mesh for the hair region in under 5 seconds, along with semantic and orientation maps, marking a significant improvement over existing methods, as demonstrated in our experiments.

Abstract:
Large annotated datasets are vital for training segmentation models, but pixel‑level labeling is time‑consuming, error‑prone, and often requires scarce expert annotators, especially in medical imaging. In contrast, coarse annotations are quicker, cheaper, and easier to produce, even by non‑experts. In this paper, we propose to use coarse drawings from both positive (target) and negative (background) classes in the image, even with noisy pixels, to train a convolutional neural network (CNN) for semantic segmentation. We present a method for learning the true segmentation label distributions from purely noisy coarse annotations using two coupled CNNs. The separation of the two CNNs is achieved by high fidelity with the characters of the noisy training annotations. We propose to add a complementary label learning that encourages estimating negative label distribution. To illustrate the properties of our method, we first use a toy segmentation dataset based on MNIST. We then present the quantitative results of experiments using publicly available datasets: Cityscapes dataset for multi‑class segmentation, and retinal images for medical applications. In all experiments, our method outperforms state‑of‑the‑art methods, particularly in the cases where the ratio of coarse annotations is small compared to the given dense annotations.

Abstract:
Camouflaged Object Segmentation (COS) poses a significant challenge due to the intrinsic high similarity between targets and backgrounds, demanding models capable of profound holistic understanding beyond superficial cues. Prevailing methods, often limited by shallow feature representation, inadequate reasoning mechanisms, and weak cross‑modal integration, struggle to achieve this depth of cognition, resulting in prevalent issues like incomplete target separation and imprecise segmentation. Inspired by the perceptual strategy of the Hundred‑eyed Giant‑emphasizing holistic observation, omnidirectional focus, and intensive scrutiny‑we introduce ArgusCogito, a novel zero‑shot, chain‑of‑thought framework underpinned by cross‑modal synergy and omnidirectional reasoning within Vision‑Language Models (VLMs). ArgusCogito orchestrates three cognitively‑inspired stages: (1) Conjecture: Constructs a strong cognitive prior through global reasoning with cross‑modal fusion (RGB, depth, semantic maps), enabling holistic scene understanding and enhanced target‑background disambiguation. (2) Focus: Performs omnidirectional, attention‑driven scanning and focused reasoning, guided by semantic priors from Conjecture, enabling precise target localization and region‑of‑interest refinement. (3) Sculpting: Progressively sculpts high‑fidelity segmentation masks by integrating cross‑modal information and iteratively generating dense positive/negative point prompts within focused regions, emulating Argus' intensive scrutiny. Extensive evaluations on four challenging COS benchmarks and three Medical Image Segmentation (MIS) benchmarks demonstrate that ArgusCogito achieves state‑of‑the‑art (SOTA) performance, validating the framework's exceptional efficacy, superior generalization capability, and robustness.

Abstract:
This paper introduces a holistic perception system for internal and external monitoring of autonomous vehicles, with the aim of demonstrating a novel AI‑leveraged self‑adaptive framework of advanced vehicle technologies and solutions that optimize perception and experience on‑board. Internal monitoring system relies on a multi‑camera setup designed for predicting and identifying driver and occupant behavior through facial recognition, exploiting in addition a large language model as virtual assistant. Moreover, the in‑cabin monitoring system includes AI‑empowered smart sensors that measure air‑quality and perform thermal comfort analysis for efficient on and off‑boarding. On the other hand, external monitoring system perceives the surrounding environment of vehicle, through a LiDAR‑based cost‑efficient semantic segmentation approach, that performs highly accurate and efficient super‑resolution on low‑quality raw 3D point clouds. The holistic perception framework is developed in the context of EU's Horizon Europe programm AutoTRUST, and has been integrated and deployed on a real electric vehicle provided by ALKE. Experimental validation and evaluation at the integration site of Joint Research Centre at Ispra, Italy, highlights increased performance and efficiency of the modular blocks of the proposed perception architecture.

Abstract:
Cross‑client data heterogeneity in federated learning induces biases that impede unbiased consensus condensation and the complementary fusion of generalization‑ and personalization‑oriented knowledge. While existing approaches mitigate heterogeneity through model decoupling and representation center loss, they often rely on static and restricted metrics to evaluate local knowledge and adopt global alignment too rigidly, leading to consensus distortion and diminished model adaptability. To address these limitations, we propose FedMate, a method that implements bilateral optimization: On the server side, we construct a dynamic global prototype, with aggregation weights calibrated by holistic integration of sample size, current parameters, and future prediction; a category‑wise classifier is then fine‑tuned using this prototype to preserve global consistency. On the client side, we introduce complementary classification fusion to enable merit‑based discrimination training and incorporate cost‑aware feature transmission to balance model performance and communication efficiency. Experiments on five datasets of varying complexity demonstrate that FedMate outperforms state‑of‑the‑art methods in harmonizing generalization and adaptation. Additionally, semantic segmentation experiments on autonomous driving datasets validate the method's real‑world scalability.

Abstract:
Recently, detection of label errors and improvement of label quality in datasets for supervised learning tasks has become an increasingly important goal in both research and industry. The consequences of incorrectly annotated data include reduced model performance, biased benchmark results, and lower overall accuracy. Current state‑of‑the‑art label error detection methods often focus on a single computer vision task and, consequently, a specific type of dataset, containing, for example, either bounding boxes or pixel‑wise annotations. Furthermore, previous methods are not learning‑based. In this work, we overcome this research gap. We present a unified method for detecting label errors in object detection, semantic segmentation, and instance segmentation datasets. In a nutshell, our approach ‑ learning to detect label errors by making them ‑ works as follows: we inject different kinds of label errors into the ground truth. Then, the detection of label errors, across all mentioned primary tasks, is framed as an instance segmentation problem based on a composite input. In our experiments, we compare the label error detection performance of our method with various baselines and state‑of‑the‑art approaches of each task's domain on simulated label errors across multiple tasks, datasets, and base models. This is complemented by a generalization study on real‑world label errors. Additionally, we release 459 real label errors identified in the Cityscapes dataset and provide a benchmark for real label error detection in Cityscapes.

Abstract:
We introduce ISALux, a novel transformer‑based approach for Low‑Light Image Enhancement (LLIE) that seamlessly integrates illumination and semantic priors. Our architecture includes an original self‑attention block, Hybrid Illumination and Semantics‑Aware Multi‑Headed Self‑ Attention (HISA‑MSA), which integrates illumination and semantic segmentation maps for en‑ hanced feature extraction. ISALux employs two self‑attention modules to independently process illumination and semantic features, selectively enriching each other to regulate luminance and high‑ light structural variations in real‑world scenarios. A Mixture of Experts (MoE)‑based Feed‑Forward Network (FFN) enhances contextual learning, with a gating mechanism conditionally activating the top K experts for specialized processing. To address overfitting in LLIE methods caused by distinct light patterns in benchmarking datasets, we enhance the HISA‑MSA module with low‑rank matrix adaptations (LoRA). Extensive qualitative and quantitative evaluations across multiple specialized datasets demonstrate that ISALux is competitive with state‑of‑the‑art (SOTA) methods. Addition‑ ally, an ablation study highlights the contribution of each component in the proposed model. Code will be released upon publication.

Abstract:
Training AI models to understand images without costly labeled data remains a challenge. We combine two techniques‑‑DINO (teacher‑student learning) and Barlow Twins (redundancy reduction)‑‑to create a model that learns better with fewer labels and less compute. While both DINO and Barlow Twins have independently demonstrated strong performance in self‑supervised learning, each comes with limitations‑‑DINO may be sensitive to certain augmentations, and Barlow Twins often requires batch sizes too large to fit on consumer hardware. By combining the redundancy‑reduction objective of Barlow Twins with the self‑distillation strategy of DINO, we aim to leverage their complementary strengths. We train a hybrid model on the MS COCO dataset using only 10% of labeled data for linear probing, and evaluate its performance against standalone DINO and Barlow Twins implementations. Preliminary results show that the combined approach achieves comparable loss and classification accuracy to DINO while maintaining strong feature representations. Attention visualizations further suggest improved semantic segmentation capability in the hybrid model. This combined method offers a scalable, label‑efficient alternative for training ViTs in resource‑constrained environments.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) with image‑level labels has gained attention for its cost‑effectiveness. Most existing methods emphasize inter‑class separation, often neglecting the shared semantics among related categories and lacking fine‑grained discrimination. To address this, we propose Contrastive Prompt Clustering (CPC), a novel WSSS framework. CPC exploits Large Language Models (LLMs) to derive category clusters that encode intrinsic inter‑class relationships, and further introduces a class‑aware patch‑level contrastive loss to enforce intra‑class consistency and inter‑class separation. This hierarchical design leverages clusters as coarse‑grained semantic priors while preserving fine‑grained boundaries, thereby reducing confusion among visually similar categories. Experiments on PASCAL VOC 2012 and MS COCO 2014 demonstrate that CPC surpasses existing state‑of‑the‑art methods in WSSS.

Abstract:
Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high‑resolution insights. However, progress in deep learning‑based EM characterization has been hampered by the scarcity of large‑scale, diverse, and expert‑annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM‑3M, the first large‑scale and multimodal EM dataset for instance‑level understanding. It comprises 5,091 high‑resolution EMs, about 3 million instance segmentation labels, and image‑level attribute‑disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text‑to‑image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM‑3M and present UniEM‑Net as a strong baseline model. Quantitative experiments demonstrate that this flow‑based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark ‑‑ available at huggingface ‑‑ will significantly accelerate progress in automated materials analysis.

Abstract:
Recent advancements in computer vision and deep learning have enhanced disaster‑response capabilities, particularly in the rapid assessment of earthquake‑affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search‑and‑rescue (SAR) operations. To address this need, we introduce DRespNeT, a high‑resolution dataset specifically developed for aerial instance segmentation of post‑earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon‑level instance segmentation annotations derived from high‑definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine‑grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO‑based instance segmentation models, specifically YOLOv8‑seg, demonstrate significant gains in real‑time situational awareness and decision‑making. Our optimized YOLOv8‑DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX‑4090 GPU for multi‑target detection, meeting real‑time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human‑robot collaboration, streamlining emergency response, and improving survivor outcomes.

Abstract:
This manuscript presents a series of my selected contributions to the topic of label‑efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real‑world applications. The contributions span both methodological developments and domain‑specific adaptations, in particular addressing challenges unique to Earth observation data such as multi‑modality, spatial resolution variability, and scene heterogeneity. The manuscript is organized around four main axes including (1) weakly supervised learning for object discovery and detection based on anomaly‑aware representations learned from large amounts of background images; (2) multi‑task learning that jointly trains on multiple datasets with disjoint annotations to improve performance on object detection and semantic segmentation; (3) self‑supervised and supervised contrastive learning with multimodal data to enhance scene classification in remote sensing; and (4) few‑shot learning for hierarchical scene classification using both explicit and implicit modeling of class hierarchies. These contributions are supported by extensive experimental results across natural and remote sensing datasets, reflecting the outcomes of several collaborative research projects. The manuscript concludes by outlining ongoing and future research directions focused on scaling and enhancing label‑efficient learning for real‑world applications.

Abstract:
Purpose: Recent developments in computational pathology have been driven by advances in Vision Foundation Models, particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt‑based zero‑shot segmentation and the use of cell‑specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells. However, general vision foundation models often face challenges with fine‑grained semantic segmentation, such as identifying specific nuclei subtypes or particular cells. Approach: In this paper, we propose the molecular‑empowered All‑in‑SAM Model to advance computational pathology by leveraging the capabilities of vision foundation models. This model incorporates a full‑stack approach, focusing on: (1) annotation‑engaging lay annotators through molecular‑empowered learning to reduce the need for detailed pixel‑level annotations, (2) learning‑adapting the SAM model to emphasize specific semantics, which utilizes its strong generalizability with SAM adapter, and (3) refinement‑enhancing segmentation accuracy by integrating Molecular‑Oriented Corrective Learning (MOCL). Results: Experimental results from both in‑house and public datasets show that the All‑in‑SAM model significantly improves cell classification performance, even when faced with varying annotation quality. Conclusions: Our approach not only reduces the workload for annotators but also extends the accessibility of precise biomedical image analysis to resource‑limited settings, thereby advancing medical diagnostics and automating pathology image analysis.

Abstract:
Tree instance segmentation of airborne laser scanning (ALS) data is of utmost importance for forest monitoring, but remains challenging due to variations in the data caused by factors such as sensor resolution, vegetation state at acquisition time, terrain characteristics, etc. Moreover, obtaining a sufficient amount of precisely labeled data to train fully supervised instance segmentation methods is expensive. To address these challenges, we propose a weakly supervised approach where labels of an initial segmentation result obtained either by a non‑finetuned model or a closed form algorithm are provided as a quality rating by a human operator. The labels produced during the quality assessment are then used to train a rating model, whose task is to classify a segmentation output into the same classes as specified by the human operator. Finally, the segmentation model is finetuned using feedback from the rating model. This in turn improves the original segmentation model by 34% in terms of correctly identified tree instances while considerably reducing the number of non‑tree instances predicted. Challenges still remain in data over sparsely forested regions characterized by small trees (less than two meters in height) or within complex surroundings containing shrubs, boulders, etc. which can be confused as trees where the performance of the proposed method is reduced.

Abstract:
Vision foundation models (VFMs) are predominantly developed using data‑centric methods. These methods require training on vast amounts of data usually with high‑quality labels, which poses a bottleneck for most institutions that lack both large‑scale data and high‑end GPUs. On the other hand, many open‑source vision models have been pretrained on domain‑specific data, enabling them to distill and represent core knowledge in a form that is transferable across diverse applications. Even though these models are highly valuable assets, they remain largely under‑explored in empowering the development of a general‑purpose VFM. In this paper, we present a new model‑driven approach for training VFMs through joint knowledge transfer and preservation. Our method unifies multiple pre‑trained teacher models in a shared latent space to mitigate the ``imbalanced transfer'' issue caused by their distributional gaps. Besides, we introduce a knowledge preservation strategy to take a general‑purpose teacher as a knowledge base for integrating knowledge from the remaining purpose‑specific teachers using an adapter module. By unifying and aggregating existing models, we build a powerful VFM to inherit teachers' expertise without needing to train on a large amount of labeled data. Our model not only provides generalizable visual features, but also inherently supports multiple downstream tasks. Extensive experiments demonstrate that our VFM outperforms existing data‑centric models across four fundamental vision tasks, including image classification, object detection, semantic and instance segmentation.

Abstract:
Segmentation in dense visual scenes poses significant challenges due to occlusions, background clutter, and scale variations. To address this, we introduce PerSense, an end‑to‑end, training‑free, and model‑agnostic one‑shot framework for Personalized instance Segmentation in dense images. PerSense employs a novel Instance Detection Module (IDM) that leverages density maps (DMs) to generate instance‑level candidate point prompts, followed by a Point Prompt Selection Module (PPSM) that filters false positives via adaptive thresholding and spatial gating. A feedback mechanism further enhances segmentation by automatically selecting effective exemplars to improve DM quality. We additionally present PerSense++, an enhanced variant that incorporates three additional components to improve robustness in cluttered scenes: (i) a diversity‑aware exemplar selection strategy that leverages feature and scale diversity for better DM generation; (ii) a hybrid IDM combining contour and peak‑based prompt generation for improved instance separation within complex density patterns; and (iii) an Irrelevant Mask Rejection Module (IMRM) that discards spatially inconsistent masks using outlier analysis. Finally, to support this underexplored task, we introduce PerSense‑D, a dedicated benchmark for personalized segmentation in dense images. Extensive experiments across multiple benchmarks demonstrate that PerSense++ outperforms existing methods in dense settings.

Abstract:
Powered by advances in multiple remote sensing sensors, the production of high spatial resolution images provides great potential to achieve cost‑efficient and high‑accuracy agricultural inventory and analysis in an automated way. Lots of studies that aim at providing an inventory of the level of each agricultural parcel have generated many methods for Agricultural Parcel and Boundary Delineation (APBD). This review covers APBD methods for detecting and delineating agricultural parcels and systematically reviews the past and present of APBD‑related research applied to remote sensing images. With the goal to provide a clear knowledge map of existing APBD efforts, we conduct a comprehensive review of recent APBD papers to build a meta‑data analysis, including the algorithm, the study site, the crop type, the sensor type, the evaluation method, etc. We categorize the methods into three classes: (1) traditional image processing methods (including pixel‑based, edge‑based and region‑based); (2) traditional machine learning methods (such as random forest, decision tree); and (3) deep learning‑based methods. With deep learning‑oriented approaches contributing to a majority, we further discuss deep learning‑based methods like semantic segmentation‑based, object detection‑based and Transformer‑based methods. In addition, we discuss five APBD‑related issues to further comprehend the APBD domain using remote sensing data, such as multi‑sensor data in APBD task, comparisons between single‑task learning and multi‑task learning in the APBD domain, comparisons among different algorithms and different APBD tasks, etc. Finally, this review proposes some APBD‑related applications and a few exciting prospects and potential hot topics in future APBD research. We hope this review help researchers who involved in APBD domain to keep track of its development and tendency.

Abstract:
This paper proposes a high‑precision semantic segmentation method based on an improved TransUNet architecture to address the challenges of complex lesion structures, blurred boundaries, and significant scale variations in skin lesion images. The method integrates a transformer module into the traditional encoder‑decoder framework to model global semantic information, while retaining a convolutional branch to preserve local texture and edge features. This enhances the model's ability to perceive fine‑grained structures. A boundary‑guided attention mechanism and multi‑scale upsampling path are also designed to improve lesion boundary localization and segmentation consistency. To verify the effectiveness of the approach, a series of experiments were conducted, including comparative studies, hyperparameter sensitivity analysis, data augmentation effects, input resolution variation, and training data split ratio tests. Experimental results show that the proposed model outperforms existing representative methods in mIoU, mDice, and mAcc, demonstrating stronger lesion recognition accuracy and robustness. In particular, the model achieves better boundary reconstruction and structural recovery in complex scenarios, making it well‑suited for the key demands of automated segmentation tasks in skin lesion analysis.

Abstract:
Weed management represents a critical challenge in agriculture, significantly impacting crop yields and requiring substantial resources for control. Effective weed monitoring and analysis strategies are crucial for implementing sustainable agricultural practices and site‑specific management approaches. We introduce WeedSense, a novel multi‑task learning architecture for comprehensive weed analysis that jointly performs semantic segmentation, height estimation, and growth stage classification. We present a unique dataset capturing 16 weed species over an 11‑week growth cycle with pixel‑level annotations, height measurements, and temporal labels. WeedSense leverages a dual‑path encoder incorporating Universal Inverted Bottleneck blocks and a Multi‑Task Bifurcated Decoder with transformer‑based feature fusion to generate multi‑scale features and enable simultaneous prediction across multiple tasks. WeedSense outperforms other state‑of‑the‑art models on our comprehensive evaluation. On our multi‑task dataset, WeedSense achieves mIoU of 89.78% for segmentation, 1.67cm MAE for height estimation, and 99.99% accuracy for growth stage classification while maintaining real‑time inference at 160 FPS. Our multitask approach achieves 3× faster inference than sequential single‑task execution and uses 32.4% fewer parameters. Please see our project page at weedsense.github.io.

Abstract:
Test‑time adaptation enables models to adapt to evolving domains. However, balancing the tradeoff between preserving knowledge and adapting to domain shifts remains challenging for model adaptation methods, since adapting to domain shifts can induce forgetting of task‑relevant knowledge. To address this problem, we propose FOCUS, a novel frequency‑based conditioning approach within a diffusion‑driven input‑adaptation framework. Utilising learned, spatially adaptive frequency priors, our approach conditions the reverse steps during diffusion‑driven denoising to preserve task‑relevant semantic information for dense prediction. FOCUS leverages a trained, lightweight, Y‑shaped Frequency Prediction Network (Y‑FPN) that disentangles high and low frequency information from noisy images. This minimizes the computational costs involved in implementing our approach in a diffusion‑driven framework. We train Y‑FPN with FrequencyMix, a novel data augmentation method that perturbs the images across diverse frequency bands, which improves the robustness of our approach to diverse corruptions. We demonstrate the effectiveness of FOCUS for semantic segmentation and monocular depth estimation across 15 corruption types and three datasets, achieving state‑of‑the‑art averaged performance. In addition to improving standalone performance, FOCUS complements existing model adaptation methods since we can derive pseudo labels from FOCUS‑denoised images for additional supervision. Even under limited, intermittent supervision with the pseudo labels derived from the FOCUS denoised images, we show that FOCUS mitigates catastrophic forgetting for recent model adaptation methods.

Abstract:
Autonomous Driving (AD) systems exhibit markedly degraded performance under adverse environmental conditions, such as low illumination and precipitation. The underrepresentation of adverse conditions in AD datasets makes it challenging to address this deficiency. To circumvent the prohibitive cost of acquiring and annotating adverse weather data, we propose a novel Domain Adaptation (DA) pipeline that transforms clear‑weather images into fog, rain, snow, and nighttime images. Here, we systematically develop and evaluate several novel data‑generation pipelines, including simulation‑only, GAN‑based, and hybrid diffusion‑GAN approaches, to synthesize photorealistic adverse images from labelled clear images. We leverage an existing DA GAN, extend it to support auxiliary inputs, and develop a novel training recipe that leverages both simulated and real images. The simulated images facilitate exact supervision by providing perfectly matched image pairs, while the real images help bridge the simulation‑to‑real (sim2real) gap. We further introduce a method to mitigate hallucinations and artifacts in Stable‑Diffusion Image‑to‑Image (img2img) outputs by blending them adaptively with their progenitor images. We finetune downstream models on our synthetic data and evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We achieve 1.85 percent overall improvement in semantic segmentation, and 4.62 percent on nighttime, demonstrating the efficacy of our hybrid method for robust AD perception under challenging conditions.

Abstract:
Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal‑Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text‑to‑video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state‑of‑the‑art performance.

Abstract:
This report presents an overview of the AIM 2025 RipSeg Challenge, a competition designed to advance techniques for automatic rip current segmentation in still images. Rip currents are dangerous, fast‑moving flows that pose a major risk to beach safety worldwide, making accurate visual detection an important and underexplored research task. The challenge builds on RipVIS, the largest available rip current dataset, and focuses on single‑class instance segmentation, where precise delineation is critical to fully capture the extent of rip currents. The dataset spans diverse locations, rip current types, and camera orientations, providing a realistic and challenging benchmark. In total, 75 participants registered for this first edition, resulting in 5 valid test submissions. Teams were evaluated on a composite score combining F_1, F_2, AP_50, and AP_[50:95], ensuring robust and application‑relevant rankings. The top‑performing methods leveraged deep learning architectures, domain adaptation techniques, pretrained models, and domain generalization strategies to improve performance under diverse conditions. This report outlines the dataset details, competition framework, evaluation metrics, and final results, providing insights into the current state of rip current segmentation. We conclude with a discussion of key challenges, lessons learned from the submissions, and future directions for expanding RipSeg.

Abstract:
Autonomous driving platforms encounter diverse driving scenarios, each with varying hardware resources and precision requirements. Given the computational limitations of embedded devices, it is crucial to consider computing costs when deploying on target platforms like the NVIDIA\textsuperscript\textregistered DRIVE PX 2. Our objective is to customize the semantic segmentation network according to the computing power and specific scenarios of autonomous driving hardware. We implement dynamic adaptability through a three‑tier control mechanism ‑‑ width multiplier, classifier depth, and classifier kernel ‑‑ allowing fine‑grained control over model components based on hardware constraints and task requirements. This adaptability facilitates broad model scaling, targeted refinement of the final layers, and scenario‑specific optimization of kernel sizes, leading to improved resource allocation and performance. Additionally, we leverage Bayesian Optimization with surrogate modeling to efficiently explore hyperparameter spaces under tight computational budgets. Our approach addresses scenario‑specific and task‑specific requirements through automatic parameter search, accommodating the unique computational complexity and accuracy needs of autonomous driving. It scales its Multiply‑Accumulate Operations (MACs) for Task‑Specific Learning Adaptation (TSLA), resulting in alternative configurations tailored to diverse self‑driving tasks. These TSLA customizations maximize computational capacity and model accuracy, optimizing hardware utilization.

Abstract:
Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance‑level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre‑processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance‑aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground‑truth to guide 2D feature learning via contrastive loss and pseudo‑supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel‑based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre‑processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open‑world driving scenes.More visualizations are available at our project page.

Abstract:
We train representation models with procedural data only, and apply them on visual similarity, classification, and semantic segmentation tasks without further training by using visual memory ‑‑ an explicit database of reference image embeddings. Unlike prior work on visual memory, our approach achieves full compartmentalization with respect to all real‑world images while retaining strong performance. Compared to a model trained on Places, our procedural model performs within 1% on NIGHTS visual similarity, outperforms by 8% and 15% on CUB200 and Flowers102 fine‑grained classification, and is within 10% on ImageNet‑1K classification. It also demonstrates strong zero‑shot segmentation, achieving an R^2 on COCO within 10% of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap.

Abstract:
Deploying multiple machine learning models on resource‑constrained robotic platforms for different perception tasks often results in redundant computations, large memory footprints, and complex integration challenges. In response, this work presents Visual Perception Engine (VPEngine), a modular framework designed to enable efficient GPU usage for visual multitasking while maintaining extensibility and developer accessibility. Our framework architecture leverages a shared foundation model backbone that extracts image representations, which are efficiently shared, without any unnecessary GPU‑CPU memory transfers, across multiple specialized task‑specific model heads running in parallel. This design eliminates the computational redundancy inherent in feature extraction component when deploying traditional sequential models while enabling dynamic task prioritization based on application demands. We demonstrate our framework's capabilities through an example implementation using DINOv2 as the foundation model with multiple task (depth, object detection and semantic segmentation) heads, achieving up to 3x speedup compared to sequential execution. Building on CUDA Multi‑Process Service (MPS), VPEngine offers efficient GPU utilization and maintains a constant memory footprint while allowing per‑task inference frequencies to be adjusted dynamically during runtime. The framework is written in Python and is open source with ROS2 C++ (Humble) bindings for ease of use by the robotics community across diverse robotic platforms. Our example implementation demonstrates end‑to‑end real‑time performance at \geq50 Hz on NVIDIA Jetson Orin AGX for TensorRT optimized models.

Abstract:
Deep neural networks have become the go‑to method for biomedical instance segmentation. Generalist models like Cellpose demonstrate state‑of‑the‑art performance across diverse cellular data, though their effectiveness often degrades on domains that differ from their training data. While supervised fine‑tuning can address this limitation, it requires annotated data that may not be readily available. We propose SelfAdapt, a method that enables the adaptation of pre‑trained cell segmentation models without the need for labels. Our approach builds upon student‑teacher augmentation consistency training, introducing L2‑SP regularization and label‑free stopping criteria. We evaluate our method on the LiveCell and TissueNet datasets, demonstrating relative improvements in AP0.5 of up to 29.64% over baseline Cellpose. Additionally, we show that our unsupervised adaptation can further improve models that were previously fine‑tuned with supervision. We release SelfAdapt as an easy‑to‑use extension of the Cellpose framework. The code for our method is publicly available at https: //github.com/Kainmueller‑Lab/self_adapt.

Abstract:
Pedestrian segmentation in automotive perception systems faces critical safety challenges due to metamerism in RGB imaging, where pedestrians and backgrounds appear visually indistinguishable.. This study investigates the potential of hyperspectral imaging (HSI) for enhanced pedestrian segmentation in urban driving scenarios using the Hyperspectral City v2 (H‑City) dataset. We compared standard RGB against two dimensionality‑reduction approaches by converting 128‑channel HSI data into three‑channel representations: Principal Component Analysis (PCA) and optimal band selection using Contrast Signal‑to‑Noise Ratio with Joint Mutual Information Maximization (CSNR‑JMIM). Three semantic segmentation models were evaluated: U‑Net, DeepLabV3+, and SegFormer. CSNR‑JMIM consistently outperformed RGB with an average improvements of 1.44% in Intersection over Union (IoU) and 2.18% in F1‑score for pedestrian segmentation. Rider segmentation showed similar gains with 1.43% IoU and 2.25% F1‑score improvements. These improved performance results from enhanced spectral discrimination of optimally selected HSI bands effectively reducing false positives. This study demonstrates robust pedestrian segmentation through optimal HSI band selection, showing significant potential for safety‑critical automotive applications.

Abstract:
Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category‑level distribution and alignment. In this paper, a category‑level geometry learning framework is proposed to explore the domain‑invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category‑level Geometry Embedding (CGE) is proposed to perceive the fine‑grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category‑level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state‑of‑the‑art domain generalized point cloud methods.

Abstract:
Human‑robot collaboration requires robots to quickly infer user intent, provide transparent reasoning, and assist users in achieving their goals. Our recent work introduced GUIDER, our framework for inferring navigation and manipulation intents. We propose augmenting GUIDER with a vision‑language model (VLM) and a text‑only language model (LLM) to form a semantic prior that filters objects and locations based on the mission prompt. A vision pipeline (YOLO for object detection and the Segment Anything Model for instance segmentation) feeds candidate object crops into the VLM, which scores their relevance given an operator prompt; in addition, the list of detected object labels is ranked by a text‑only LLM. These scores weight the existing navigation and manipulation layers of GUIDER, selecting context‑relevant targets while suppressing unrelated objects. Once the combined belief exceeds a threshold, autonomy changes occur, enabling the robot to navigate to the desired area and retrieve the desired object, while adapting to any changes in the operator's intent. Future work will evaluate the system on Isaac Sim using a Franka Emika arm on a Ridgeback base, with a focus on real‑time assistance.

Abstract:
A detailed understanding of material structure across multiple scales is essential for predictive modeling of textile‑reinforced composites. Nesting ‑‑ characterized by the interlocking of adjacent fabric layers through local interpenetration and misalignment of yarns ‑‑ plays a critical role in defining mechanical properties such as stiffness, permeability, and damage tolerance. This study presents a framework to quantify nesting behavior in dry textile reinforcements under compaction using low‑resolution computed tomography (CT). In‑situ compaction experiments were conducted on various stacking configurations, with CT scans acquired at 20.22 μm per voxel resolution. A tailored 3D‑UNet enabled semantic segmentation of matrix, weft, and fill phases across compaction stages corresponding to fiber volume contents of 50‑‑60 %. The model achieved a minimum mean Intersection‑over‑Union of 0.822 and an F1 score of 0.902. Spatial structure was subsequently analyzed using the two‑point correlation function S_2, allowing for probabilistic extraction of average layer thickness and nesting degree. The results show strong agreement with micrograph‑based validation. This methodology provides a robust approach for extracting key geometrical features from industrially relevant CT data and establishes a foundation for reverse modeling and descriptor‑based structural analysis of composite preforms.

Abstract:
Referring‑based Video Object Segmentation is a multimodal problem that requires producing fine‑grained segmentation results guided by external cues. Traditional approaches to this task typically involve training specialized models, which come with high computational complexity and manual annotation effort. Recent advances in vision‑language foundation models open a promising direction toward training‑free approaches. Several studies have explored leveraging these general‑purpose models for fine‑grained segmentation, achieving performance comparable to that of fully supervised, task‑specific models. However, existing methods rely on fixed pipelines that lack the flexibility needed to adapt to the dynamic nature of the task. To address this limitation, we propose Multi‑Modal Agent, a novel agentic system designed to solve this task in a more flexible and adaptive manner. Specifically, our method leverages the reasoning capabilities of large language models (LLMs) to generate dynamic workflows tailored to each input. This adaptive procedure iteratively interacts with a set of specialized tools designed for low‑level tasks across different modalities to identify the target object described by the multimodal cues. Our agentic approach demonstrates clear improvements over prior methods on two multimodal‑conditioned VOS tasks: RVOS and Ref‑AVS.

Abstract:
Post‑Training Quantization (PTQ) and Quantization‑Aware Training (QAT) represent two mainstream model quantization approaches. However, PTQ often leads to unacceptable performance degradation in quantized models, while QAT imposes substantial GPU memory requirements and extended training time due to weight fine‑tuning. In this paper, we propose PTQAT, a novel general hybrid quantization algorithm for the efficient deployment of 3D perception networks. To address the speed accuracy trade‑off between PTQ and QAT, our method selects critical layers for QAT fine‑tuning and performs PTQ on the remaining layers. Contrary to intuition, fine‑tuning the layers with smaller output discrepancies before and after quantization, rather than those with larger discrepancies, actually leads to greater improvements in the model's quantization accuracy. This means we better compensate for quantization errors during their propagation, rather than addressing them at the point where they occur. The proposed PTQAT achieves similar performance to QAT with more efficiency by freezing nearly 50% of quantifiable layers. Additionally, PTQAT is a universal quantization method that supports various quantization bit widths (4 bits) as well as different model architectures, including CNNs and Transformers. The experimental results on nuScenes across diverse 3D perception tasks, including object detection, semantic segmentation, and occupancy prediction, show that our method consistently outperforms QAT‑only baselines. Notably, it achieves 0.2%‑0.9% NDS and 0.3%‑1.0% mAP gains in object detection, 0.3%‑2.0% mIoU gains in semantic segmentation and occupancy prediction while fine‑tuning fewer weights.

Abstract:
While previous studies on image segmentation focus on handling severe (or explicit) label noise, real‑world datasets also exhibit subtle (or implicit) label imperfections. These arise from inherent challenges, such as ambiguous object boundaries and annotator variability. Although not explicitly present, such mild and latent noise can still impair model performance. Typical data augmentation methods, which apply identical transformations to the image and its label, risk amplifying these subtle imperfections and limiting the model's generalization capacity. In this paper, we introduce NSegment+, a novel augmentation framework that decouples image and label transformations to address such realistic noise for semantic segmentation. By introducing controlled elastic deformations only to segmentation labels while preserving the original images, our method encourages models to focus on learning robust representations of object structures despite minor label inconsistencies. Extensive experiments demonstrate that NSegment+ consistently improves performance, achieving mIoU gains of up to +2.29, +2.38, +1.75, and +3.39 in average on Vaihingen, LoveDA, Cityscapes, and PASCAL VOC, respectively‑even without bells and whistles, highlighting the importance of addressing implicit label noise. These gains can be further amplified when combined with other training tricks, including CutMix and Label Smoothing.

Abstract:
This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th‑century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence‑based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.

Abstract:
Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention‑based Normalizing Flow (MANGO) approach to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross‑Attention (ICA) layer to develop the Normalizing Flow‑based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross‑attention layer, we propose three new cross‑attention mechanisms: Modality‑to‑Modality Cross‑Attention (MMCA), Inter‑Modality Cross‑Attention (IMCA), and Learnable Inter‑Modality Cross‑Attention (LICA). Finally, we introduce a new Multimodal Attention‑based Normalizing Flow to enable the scalability of our proposed method to high‑dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image‑to‑image translation, and movie genre classification, have illustrated the state‑of‑the‑art (SoTA) performance of the proposed approach.

Abstract:
Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in‑context learning (ICL) ‑‑ the ability to leverage a few sets of example prompts to adapt to various tasks without having to explicitly update the model weights. ICL has recently been explored for computer vision tasks with promising early outcomes. These approaches involve specialized training and/or additional data that complicate the process and limit its generalizability. In this work, we show that off‑the‑shelf Stable Diffusion models can be repurposed for visual in‑context learning (V‑ICL). Specifically, we formulate an in‑place attention re‑computation within the self‑attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine‑tuning, we show that this repurposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal‑5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance.

Abstract:
A prevailing trend in neural network research suggests that model performance improves with increasing depth and capacity ‑ often at the cost of integrability and efficiency. In this paper, we propose a strategy to optimize small, deployable models by enhancing their capabilities through structured knowledge inheritance. We introduce Hereditary Knowledge Transfer (HKT), a biologically inspired framework for modular and selective transfer of task‑relevant features from a larger, pretrained parent network to a smaller child model. Unlike standard knowledge distillation, which enforces uniform imitation of teacher outputs, HKT draws inspiration from biological inheritance mechanisms ‑ such as memory RNA transfer in planarians ‑ to guide a multi‑stage process of feature transfer. Neural network blocks are treated as functional carriers, and knowledge is transmitted through three biologically motivated components: Extraction, Transfer, and Mixture (ETM). A novel Genetic Attention (GA) mechanism governs the integration of inherited and native representations, ensuring both alignment and selectivity. We evaluate HKT across diverse vision tasks, including optical flow (Sintel, KITTI), image classification (CIFAR‑10), and semantic segmentation (LiTS), demonstrating that it significantly improves child model performance while preserving its compactness. The results show that HKT consistently outperforms conventional distillation approaches, offering a general‑purpose, interpretable, and scalable solution for deploying high‑performance neural networks in resource‑constrained environments.

Abstract:
In the task of 3D Aerial‑view Scene Semantic Segmentation (3D‑AVS‑SS), traditional methods struggle to address semantic ambiguity caused by scale variations and structural occlusions in aerial images. This limits their segmentation accuracy and consistency. To tackle these challenges, we propose a novel 3D‑AVS‑SS approach named SAD‑Splat. Our method introduces a Gaussian point drop module, which integrates semantic confidence estimation with a learnable sparsity mechanism based on the Hard Concrete distribution. This module effectively eliminates redundant and semantically ambiguous Gaussian points, enhancing both segmentation performance and representation compactness. Furthermore, SAD‑Splat incorporates a high‑confidence pseudo‑label generation pipeline. It leverages 2D foundation models to enhance supervision when ground‑truth labels are limited, thereby further improving segmentation accuracy. To advance research in this domain, we introduce a challenging benchmark dataset: 3D Aerial Semantic (3D‑AS), which encompasses diverse real‑world aerial scenes with sparse annotations. Experimental results demonstrate that SAD‑Splat achieves an excellent balance between segmentation accuracy and representation compactness. It offers an efficient and scalable solution for 3D aerial scene understanding.

Abstract:
Semantic segmentation of city‑scale point clouds is a critical technology for Unmanned Aerial Vehicle (UAV) perception systems, enabling the classification of 3D points without relying on any visual information to achieve comprehensive 3D understanding. However, existing models are frequently constrained by the limited scale of 3D data and the domain gap between datasets, which lead to reduced generalization capability. To address these challenges, we propose CitySeg, a foundation model for city‑scale point cloud semantic segmentation that incorporates text modality to achieve open vocabulary segmentation and zero‑shot inference. Specifically, in order to mitigate the issue of non‑uniform data distribution across multiple domains, we customize the data preprocessing rules, and propose a local‑global cross‑attention network to enhance the perception capabilities of point networks in UAV scenarios. To resolve semantic label discrepancies across datasets, we introduce a hierarchical classification strategy. A hierarchical graph established according to the data annotation rules consolidates the data labels, and the graph encoder is used to model the hierarchical relationships between categories. In addition, we propose a two‑stage training strategy and employ hinge loss to increase the feature separability of subcategories. Experimental results demonstrate that the proposed CitySeg achieves state‑of‑the‑art (SOTA) performance on nine closed‑set benchmarks, significantly outperforming existing approaches. Moreover, for the first time, CitySeg enables zero‑shot generalization in city‑scale point cloud scenarios without relying on visual information.

Abstract:
3D Gaussian Splatting has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Existing segmentation approaches typically rely on high‑dimensional feature lifting, which causes costly optimization, implicit semantics, and task‑specific constraints. We present Segment Any Gaussians Online (SAGOnline), a unified, zero‑shot framework that achieves real‑time, cross‑view consistent segmentation without scene‑specific training. SAGOnline decouples the monolithic segmentation problem into lightweight sub‑tasks. By integrating video foundation models (e.g., SAM 2), we first generate temporally consistent 2D masks across rendered views. Crucially, instead of learning continuous feature fields, we introduce a Rasterization‑aware Geometric Consensus mechanism that leverages the traceability of the Gaussian rasterization pipeline. This allows us to deterministically map 2D predictions to explicit, discrete 3D primitive labels in real‑time. This discrete representation eliminates the memory and computational burden of feature distillation, enabling instant inference. Extensive evaluations on NVOS and SPIn‑NeRF benchmarks demonstrate that SAGOnline achieves state‑of‑the‑art accuracy (92.7% and 95.2% mIoU) while operating at the fastest speed at 27 ms per frame. By providing a flexible interface for diverse foundation models, our framework supports instant prompt, instance, and semantic segmentation, paving the way for interactive 3D understanding in AR/VR and robotics.

Abstract:
Multi‑object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking‑by‑detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth‑aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high‑fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel‑Based 3D Intersection‑over‑Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth‑aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth‑enhanced Observation‑Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion‑based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns.

Abstract:
The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre‑training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi‑Model State Space Block, combines mask‑image‑text cross‑attention and a 3D‑Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long‑term dependencies among text, masks, and images. This enables favorable results in pre‑trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.

Abstract:
Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta‑learning, which still necessitates an extensive meta‑training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference‑target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV‑SAM). CAV‑SAM comprises two key modules: the Diffusion‑Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test‑Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test‑time fine‑tuning. We evaluated CAVSAM on widely‑used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.

Abstract:
Vision‑language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner‑Refiner, a framework to overcome these challenges. Planner‑Refiner bridges the semantic gap by iteratively refining visual elements' space‑time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun‑phrase and verb‑phrase pair, to direct visual tokens' self‑attention across space then time, achieving efficient single‑step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task‑specific heads for alignment generation. We demonstrate Planner‑Refiner's effectiveness on two video‑language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS‑X benchmark to assess models' capability with long queries. Superior performance versus state‑of‑the‑art methods on these benchmarks shows the approach's potential, especially for complex prompts.

Abstract:
Training‑free Camouflaged Object Segmentation (COS) seeks to segment camouflaged objects without task‑specific training, by automatically generating visual prompts to guide the Segment Anything Model (SAM). However, existing pipelines mostly yield semantic‑level prompts, which drive SAM to coarse semantic masks and struggle to handle multiple discrete camouflaged instances effectively. To address this critical limitation, we propose an Instance‑Aware Prompting Framework (IAPF) tailored for the first training‑free COS that upgrades prompt granularity from semantic to instance‑level while keeping all components frozen. The centerpiece is an Instance Mask Generator that (i) leverages a detector‑agnostic enumerator to produce precise instance‑level box prompts for the foreground tag, and (ii) introduces the Single‑Foreground Multi‑Background Prompting (SFMBP) strategy to sample region‑constrained point prompts within each box prompt, enabling SAM to output instance masks. The pipeline is supported by a simple text prompt generator that produces image‑specific tags and a self‑consistency vote across synonymous task‑generic prompts to stabilize inference. Extensive evaluations on three COS benchmarks, two CIS benchmarks, and two downstream datasets demonstrate state‑of‑the‑art performance among training‑free methods. Code will be released upon acceptance.

Abstract:
Safety‑critical control using high‑dimensional sensory feedback from optical data (e.g., images, point clouds) poses significant challenges in domains like autonomous driving and robotic surgery. Control can rely on low‑dimensional states estimated from high‑dimensional data. However, the estimation errors often follow complex, unknown distributions that standard probabilistic models fail to capture, making formal safety guarantees challenging. In this work, we introduce a novel characterization of these general estimation errors using sub‑Gaussian noise with bounded mean. We develop a new technique for uncertainty propagation of proposed noise characterization in linear systems, which combines robust set‑based methods with the propagation of sub‑Gaussian variance proxies. We further develop a Model Predictive Control (MPC) framework that provides closed‑loop safety guarantees for linear systems under the proposed noise assumption. We apply this MPC approach in an ultrasound‑image‑guided robotic spinal surgery pipeline, which contains deep‑learning‑based semantic segmentation, image‑based registration, high‑level optimization‑based planning, and low‑level robotic control. To validate the pipeline, we developed a realistic simulation environment integrating real human anatomy, robot dynamics, efficient ultrasound simulation, as well as in‑vivo data of breathing motion and drilling force. Evaluation results in simulation demonstrate the potential of our approach for solving complex image‑guided robotic surgery task while ensuring safety.

Abstract:
Training deep neural networks (DNNs) has become an increasingly resource‑intensive task, requiring large volumes of labeled data, substantial computational power, and considerable fine‑tuning efforts to achieve optimal performance across diverse use cases. Although pre‑trained models offer a useful starting point, adapting them to meet specific user needs often demands extensive customization, and infrastructure overhead. This challenge grows when a single model must support diverse appli‑cations with differing requirements for performance. Traditional solutions often involve training multiple model versions to meet varying requirements, which can be inefficient and difficult to maintain. In order to overcome this challenge, we propose NNObfuscator, a novel utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. It is different from traditional methods that need separate models for each user. Instead, NNObfuscator allows a single model to be adapted in real time, giving you controlled access to multiple levels of performance. This mechanism enables model owners set up tiered access, ensuring that free‑tier users receive a baseline level of performance while premium users benefit from enhanced capabilities. The approach improves resource allocation, reduces unnecessary computation, and supports sustainable business models in AI deployment. To validate our approach, we conducted experiments on multiple tasks, including image classification, semantic segmentation, and text to image generation, using well‑established models such as ResNet, DeepLab, VGG16, FCN and Stable Diffusion. Experimental results show that NNObfuscator successfully makes model more adaptable, so that a single trained model can handle a broad range of tasks without requiring a lot of changes.

Abstract:
Accurate semantic segmentation of urban remote sensing images (URSIs) is essential for urban planning and environmental monitoring. However, it remains challenging due to the subtle texture differences and similar spatial structures among geospatial objects, which cause semantic ambiguity and misclassification. Additional complexities arise from irregular object shapes, blurred boundaries, and overlapping spatial distributions of objects, resulting in diverse and intricate edge morphologies. To address these issues, we propose TEFormer, a texture‑aware and edge‑guided Transformer. Our model features a texture‑aware module (TaM) in the encoder to capture fine‑grained texture distinctions between visually similar categories, thereby enhancing semantic discrimination. The decoder incorporates an edge‑guided tri‑branch decoder (Eg3Head) to preserve local edges and details while maintaining multiscale context‑awareness. Finally, an edge‑guided feature fusion module (EgFFM) effectively integrates contextual, detail, and edge information to achieve refined semantic segmentation. Extensive evaluation demonstrates that TEFormer yields mIoU scores of 88.57% on Potsdam and 81.46% on Vaihingen, exceeding the next best methods by 0.73% and 0.22%. On the LoveDA dataset, it secures the second position with an overall mIoU of 53.55%, trailing the optimal performance by a narrow margin of 0.19%.

Abstract:
Pleural effusion semantic segmentation can significantly enhance the accuracy and timeliness of clinical diagnosis and treatment by precisely identifying disease severity and lesion areas. Currently, semantic segmentation of pleural effusion CT images faces multiple challenges. These include similar gray levels between effusion and surrounding tissues, blurred edges, and variable morphology. Existing methods often struggle with diverse image variations and complex edges, primarily because direct feature concatenation causes semantic gaps. To address these challenges, we propose the Dual‑Branch Interactive Fusion Attention model (DBIF‑AUNet). This model constructs a densely nested skip‑connection network and innovatively refines the Dual‑Domain Feature Disentanglement module (DDFD). The DDFD module orthogonally decouples the functions of dual‑domain modules to achieve multi‑scale feature complementarity and enhance characteristics at different levels. Concurrently, we design a Branch Interaction Attention Fusion module (BIAF) that works synergistically with the DDFD. This module dynamically weights and fuses global, local, and frequency band features, thereby improving segmentation robustness. Furthermore, we implement a nested deep supervision mechanism with hierarchical adaptive hybrid loss to effectively address class imbalance. Through validation on 1,622 pleural effusion CT images from Southwest Hospital, DBIF‑AUNet achieved IoU and Dice scores of 80.1% and 89.0% respectively. These results outperform state‑of‑the‑art medical image segmentation models U‑Net++ and Swin‑UNet by 5.7%/2.7% and 2.2%/1.5% respectively, demonstrating significant optimization in segmentation accuracy for complex pleural effusion CT images.

Abstract:
Semantic segmentation in open‑vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly‑supervised methods often rely on category‑specific supervision and ill‑suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly‑supervised approach, SynSeg, to address the challenges. SynSeg performs Multi‑Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra‑ and inter‑category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic‑activation‑map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. Furthermore, SynSeg is a lightweight end‑to‑end solution without using any mid‑term output from large‑scale pretrained models and capable for real‑time inference. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision in an efficient manner. Extensive experiments on benchmarks demonstrate that our method outperforms state‑of‑the‑art (SOTA) performance. Particularly, SynSeg achieves higher accuracy than SOTA baselines with a ratio from 6.9% up to 26.2%.

Abstract:
Autonomous driving systems face significant challenges in perceiving complex environments and making real‑time decisions. Traditional modular approaches, while offering interpretability, suffer from error propagation and coordination issues, whereas end‑to‑end learning systems can simplify the design but face computational bottlenecks. This paper presents a novel approach to autonomous driving using deep reinforcement learning (DRL) that integrates bird's‑eye view (BEV) perception for enhanced real‑time decision‑making. We introduce the \textttMamba‑BEV model, an efficient spatio‑temporal feature extraction network that combines BEV‑based perception with the Mamba framework for temporal feature modeling. This integration allows the system to encode vehicle surroundings and road features in a unified coordinate system and accurately model long‑range dependencies. Building on this, we propose the \textttME^3‑BEV framework, which utilizes the \textttMamba‑BEV model as a feature input for end‑to‑end DRL, achieving superior performance in dynamic urban driving scenarios. We further enhance the interpretability of the model by visualizing high‑dimensional features through semantic segmentation, providing insight into the learned representations. Extensive experiments on the CARLA simulator demonstrate that \textttME^3‑BEV outperforms existing models across multiple metrics, including collision rate and trajectory accuracy, offering a promising solution for real‑time autonomous driving.

Abstract:
Existing methods for human parsing into body parts and clothing often use fixed mask categories with broad labels that obscure fine‑grained clothing types. Recent open‑vocabulary segmentation approaches leverage pretrained text‑to‑image (T2I) diffusion model features for strong zero‑shot transfer, but typically group entire humans into a single person category, failing to distinguish diverse clothing or detailed body parts. To address this, we propose Spectrum, a unified network for part‑level pixel parsing (body parts and clothing) and instance‑level grouping. While diffusion‑based open‑vocabulary models generalize well across tasks, their internal representations are not specialized for detailed human parsing. We observe that, unlike diffusion models with broad representations, image‑driven 3D texture generators maintain faithful correspondence to input images, enabling stronger representations for parsing diverse clothing and body parts. Spectrum introduces a novel repurposing of an Image‑to‑Texture (I2Tx) diffusion model (obtained by fine‑tuning a T2I model on 3D human texture maps) for improved alignment with body parts and clothing. From an input image, we extract human‑part internal features via the I2Tx diffusion model and generate semantically valid masks aligned to diverse clothing categories through prompt‑guided grounding. Once trained, Spectrum produces semantic segmentation maps for every visible body part and clothing category, ignoring standalone garments or irrelevant objects, for any number of humans in the scene. We conduct extensive cross‑dataset experiments, separately assessing body parts, clothing parts, unseen clothing categories, and full‑body masks, and demonstrate that Spectrum consistently outperforms baseline methods in prompt‑based segmentation.

Abstract:
Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi‑scale representations, making it a popular backbone for segmentation in videos. However, despite its window‑attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real‑time, resource‑constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window‑based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training‑free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine‑tuning‑free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine‑grained details while significantly reducing computation. Extensive evaluations on YouTube‑VIS 2019, YouTube‑VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy‑speed trade‑off of existing clustering‑based methods. Our results demonstrate that TCA generalizes competently across both natural and domain‑specific videos.

Abstract:
In recent years, large‑scale visual backbones have demonstrated remarkable capabilities in learning general‑purpose features from images via extensive pre‑training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in‑domain benchmarks. However, we observe that for smaller models, the performance drop on out‑of‑distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator‑patch Cross Attention (CoCA) mechanism, featuring dynamic, domain‑aware global tokens that enhance local‑global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real‑time visual representation. Extensive experiments empirically validate our design. At a resolution of 224224, CoCAViT‑28M achieves 84.0% top‑1 accuracy on ImageNet‑1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.

Abstract:
Open‑world point cloud semantic segmentation (OW‑Seg) aims to predict point labels of both base and novel classes in real‑world scenarios. However, existing methods rely on resource‑intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW‑Seg, the first human‑in‑the‑loop framework for OW‑Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra‑class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW‑Seg enables prototype‑based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW‑Seg dynamically improves its predictions, achieving high‑quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one‑novel‑class‑one‑click), HOW‑Seg matches or surpasses the state‑of‑the‑art generalized few‑shot segmentation (GFS‑Seg) method under the 5‑shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub‑scene), HOW‑Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.

Abstract:
Atrial fibrillation (AF) represents the most prevalent type of cardiac arrhythmia for which treatment may require patients to undergo ablation therapy. In this surgery cardiac tissues are locally scarred on purpose to prevent electrical signals from causing arrhythmia. Patient‑specific cardiac digital twin models show great potential for personalized ablation therapy, however, they demand accurate semantic segmentation of healthy and scarred tissue typically obtained from late gadolinium enhanced (LGE) magnetic resonance (MR) scans. In this work we propose the Left Atrial Cascading Refinement CNN (LA‑CaRe‑CNN), which aims to accurately segment the left atrium as well as left atrial scar tissue from LGE MR scans. LA‑CaRe‑CNN is a 2‑stage CNN cascade that is trained end‑to‑end in 3D, where Stage 1 generates a prediction for the left atrium, which is then refined in Stage 2 in conjunction with the original image information to obtain a prediction for the left atrial scar tissue. To account for domain shift towards domains unknown during training, we employ strong intensity and spatial augmentation to increase the diversity of the training dataset. Our proposed method based on a 5‑fold ensemble achieves great segmentation results, namely, 89.21% DSC and 1.6969 mm ASSD for the left atrium, as well as 64.59% DSC and 91.80% G‑DSC for the more challenging left atrial scar tissue. Thus, segmentations obtained through LA‑CaRe‑CNN show great potential for the generation of patient‑specific cardiac digital twin models and downstream tasks like personalized targeted ablation therapy to treat AF.

Abstract:
Continual Semantic Segmentation (CSS) seeks to incrementally learn to segment novel classes while preserving knowledge of previously encountered ones. Recent advancements in CSS have been largely driven by the adoption of Pre‑trained Vision Models (PVMs) as backbones. Among existing strategies, Direct Fine‑Tuning (DFT), which sequentially fine‑tunes the model across classes, remains the most straightforward approach. Prior work often regards DFT as a performance lower bound due to its presumed vulnerability to severe catastrophic forgetting, leading to the development of numerous complex mitigation techniques. However, we contend that this prevailing assumption is flawed. In this paper, we systematically revisit forgetting in DFT across two standard benchmarks, Pascal VOC 2012 and ADE20K, under eight CSS settings using two representative PVM backbones: ResNet101 and Swin‑B. Through a detailed probing analysis, our findings reveal that existing methods significantly underestimate the inherent anti‑forgetting capabilities of PVMs. Even under DFT, PVMs retain previously learned knowledge with minimal forgetting. Further investigation of the feature space indicates that the observed forgetting primarily arises from the classifier's drift away from the PVM, rather than from degradation of the backbone representations. Based on this insight, we propose DFT, a simple yet effective enhancement to DFT that incorporates strategies such as freezing the PVM backbone and previously learned classifiers, as well as pre‑allocating future classifiers. Extensive experiments show that DFT consistently achieves competitive or superior performance compared to sixteen state‑of‑the‑art CSS methods, while requiring substantially fewer trainable parameters and less training time.

Abstract:
This paper presents OC‑DiT, a novel class of diffusion models designed for object‑centric prediction, and applies it to zero‑shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large‑scale synthetic dataset comprising thousands of high‑quality object meshes. Remarkably, our model achieves state‑of‑the‑art performance on multiple challenging real‑world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.

Abstract:
In remote sensing, most segmentation networks adopt the UNet architecture, often incorporating modules such as Transformers or Mamba to enhance global‑local feature interactions within decoder stages. However, these enhancements typically focus on intra‑scale relationships and neglect the global contextual dependencies across multiple resolutions. To address this limitation, we introduce the Terrace Convolutional Decoder Network (TNet), a simple yet effective architecture that leverages only convolution and addition operations to progressively integrate low‑resolution features (rich in global context) into higher‑resolution features (rich in local details) across decoding stages. This progressive fusion enables the model to learn spatially‑aware convolutional kernels that naturally blend global and local information in a stage‑wise manner. We implement TNet with a ResNet‑18 encoder (TNet‑R) and evaluate it on three benchmark datasets. TNet‑R achieves competitive performance with a mean Intersection‑over‑Union (mIoU) of 85.35% on ISPRS Vaihingen, 87.05% on ISPRS Potsdam, and 52.19% on LoveDA, while maintaining high computational efficiency. Code is publicly available.

Abstract:
Accurately estimating urban rail platform occupancy can enhance transit agencies' ability to make informed operational decisions, thereby improving safety, operational efficiency, and customer experience, particularly in the context of crowding. However, sensing real‑time crowding remains challenging and often depends on indirect proxies such as automatic fare collection data or staff observations. Recently, Closed‑Circuit Television (CCTV) footage has emerged as a promising data source with the potential to yield accurate, real‑time occupancy estimates. The presented study investigates this potential by comparing three state‑of‑the‑art computer vision approaches for extracting crowd‑related features from platform CCTV imagery: (a) object detection and counting using YOLOv11, RT‑DETRv2, and APGCC; (b) crowd‑level classification via a custom‑trained Vision Transformer, Crowd‑ViT; and (c) semantic segmentation using DeepLabV3. Additionally, we present a novel, highly efficient linear‑optimization‑based approach to extract counts from the generated segmentation maps while accounting for image object depth and, thus, for passenger dispersion along a platform. Tested on a privacy‑preserving dataset created in collaboration with the Washington Metropolitan Area Transit Authority (WMATA) that encompasses more than 600 hours of video material, our results demonstrate that computer vision approaches can provide substantive value for crowd estimation. This work demonstrates that CCTV image data, independent of other data sources available to a transit agency, can enable more precise real‑time crowding estimation and, eventually, timely operational responses for platform crowding mitigation.

Abstract:
Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform unconditional generation with poor controllability or adopt text‑guided synthesis, which lacks fine‑grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low‑cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB are vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB appearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in non‑overlap regions between images and LiDAR. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: a Confidence‑Aware Conditioning Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; a Geometric Cross‑Modal Alignment (GCMA) for robust RGB‑LiDAR alignment under noisy diffusion; and a Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics, Cross‑Modal Semantic Consistency and Cross‑Modal Depth Consistency, to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI‑Weather benchmark demonstrate that Veila achieves state‑of‑the‑art generation fidelity and cross‑modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.

Abstract:
Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state‑of‑the‑art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object‑centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on three datasets to compare the state‑of‑the‑art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running up to 2.7x faster. Moreover, our student surpasses all other state‑of‑the‑art unsupervised video segmentation models.

Abstract:
Accurate and efficient characterization of nanoparticle morphology in Scanning Electron Microscopy (SEM) images is critical for ensuring product quality in nanomaterial synthesis and accelerating development. However, conventional deep learning methods for shape classification require extensive labeled datasets and computationally demanding training, limiting their accessibility to the typical nanoparticle practitioner in research and industrial settings. In this study, we introduce a zero‑shot classification pipeline that leverages two vision foundation models: the Segment Anything Model (SAM) for object segmentation and DINOv2 for feature embedding. By combining these models with a lightweight classifier, we achieve high‑precision shape classification across three morphologically diverse nanoparticle datasets ‑ without the need for extensive parameter fine‑tuning. Our methodology outperforms a fine‑tuned YOLOv11 and ChatGPT o4‑mini‑high baselines, demonstrating robustness to small datasets, subtle morphological variations, and domain shifts from natural to scientific imaging. Quantitative clustering metrics on PCA plots of the DINOv2 features are discussed as a means of assessing the progress of the chemical synthesis. This work highlights the potential of foundation models to advance automated microscopy image analysis, offering an alternative to traditional deep learning pipelines in nanoparticle research which is both more efficient and more accessible to the user.

Abstract:
Modality‑agnostic Semantic Segmentation (MaSS) aims to achieve robust scene understanding across arbitrary combinations of input modality. Existing methods typically rely on explicit feature alignment to achieve modal homogenization, which dilutes the distinctive strengths of each modality and destroys their inherent complementarity. To achieve cooperative harmonization rather than homogenization, we propose CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality‑specific advantages through two components: (1) Mutual Perception Unit (MPU), enabling implicit alignment through window‑based cross‑modal interaction, where modalities serve as both queries and contexts for each other to discover modality‑interactive correspondences; (2) A dual‑path optimization strategy that decouples training into Collaborative Learning Strategy (CoL) for complementary fusion learning and Individual Enhancement Strategy (InE) for protected modality‑specific optimization. Experiments across multiple datasets and backbones indicate that CHARM consistently outperform the baselines, with significant increment on the fragile modalities. This work shifts the focus from model homogenization to harmonization, enabling cross‑modal complementarity for true harmony in diversity.

Abstract:
Domain Generalized Semantic Segmentation (DGSS) aims to improve the generalization ability of models across unseen domains without access to target data during training. Recent advances in DGSS have increasingly exploited vision foundation models (VFMs) via parameter‑efficient fine‑tuning strategies. However, most existing approaches concentrate on global feature fine‑tuning, while overlooking hierarchical adaptation across feature levels, which is crucial for precise dense prediction. In this paper, we propose Multi‑Granularity Feature Calibration (MGFC), a novel framework that performs coarse‑to‑fine alignment of VFM features to enhance robustness under domain shifts. Specifically, MGFC first calibrates coarse‑grained features to capture global contextual semantics and scene‑level structure. Then, it refines medium‑grained features by promoting category‑level feature discriminability. Finally, fine‑grained features are calibrated through high‑frequency spatial detail enhancement. By performing hierarchical and granularity‑aware calibration, MGFC effectively transfers the generalization strengths of VFMs to the domain‑specific task of DGSS. Extensive experiments on benchmark datasets demonstrate that our method outperforms state‑of‑the‑art DGSS approaches, highlighting the effectiveness of multi‑granularity adaptation for the semantic segmentation task of domain generalization.

Abstract:
Foundation models have shown strong performance in multi‑object segmentation with visual prompts, yet histopathology images remain challenging due to high cellular density, heterogeneity, and the gap between pixel‑level supervision and clinical segmentation intent (e.g., selectively segmenting nuclei of a specific type). In practice, such intents are expressed through diverse and noisy prompts, causing prompt‑intent misalignment and inconsistent predictions. We introduce SAMPO (Segmentation Anything Model with Preference Optimization), a preference‑aligned fine‑tuning framework that explicitly aligns pathology foundation models with clinical segmentation intent. SAMPO is the first to adapt Direct Preference Optimization (DPO) to pure vision foundation models, enabling accurate segmentation from minimal and imperfect prompts. The framework features three key components: (1) online prompt‑centric preference mining to synthesize preference pairs across prompt qualities; (2) multi‑mask preference learning to leverage output ambiguity for fine‑grained ranking supervision; and (3) a hybrid loss combining preference optimization with pixel‑level supervision for stable training. Trained on two datasets covering four tasks and evaluated on corresponding test sets and 12 external validation datasets, SAMPO consistently improves segmentation accuracy, robustness to prompt variations, and clinical intent adherence in dense histopathology images.

Abstract:
Over the past decade, object detection has advanced significantly, with the YOLO (You Only Look Once) family of models transforming the landscape of real‑time vision applications through unified, end‑to‑end detection frameworks. From YOLOv1's pioneering regression‑based detection to the latest YOLOv9, each version has systematically enhanced the balance between speed, accuracy, and deployment efficiency through continuous architectural and algorithmic advancements.. Beyond core object detection, modern YOLO architectures have expanded to support tasks such as instance segmentation, pose estimation, object tracking, and domain‑specific applications including medical imaging and industrial automation. This paper offers a comprehensive review of the YOLO family, highlighting architectural innovations, performance benchmarks, extended capabilities, and real‑world use cases. We critically analyze the evolution of YOLO models and discuss emerging research directions that extend their impact across diverse computer vision domains.

Abstract:
Activity and behaviour correlate with dairy cow health and welfare, making continual and accurate monitoring crucial for disease identification and farm productivity. Manual observation and frequent assessments are laborious and inconsistent for activity monitoring. In this study, we developed a unique multi‑camera, real‑time tracking system for indoor‑housed Holstein Friesian dairy cows. This technology uses cutting‑edge computer vision techniques, including instance segmentation and tracking algorithms to monitor cow activity seamlessly and accurately. An integrated top‑down barn panorama was created by geometrically aligning six camera feeds using homographic transformations. The detection phase used a refined YOLO11‑m model trained on an overhead cow dataset, obtaining high accuracy (mAP\@0.50 = 0.97, F1 = 0.95). SAMURAI, an upgraded Segment Anything Model 2.1, generated pixel‑precise cow masks for instance segmentation utilizing zero‑shot learning and motion‑aware memory. Even with occlusion and fluctuating posture, a motion‑aware Linear Kalman filter and IoU‑based data association reliably identified cows over time for object tracking. The proposed system significantly outperformed Deep SORT Realtime. Multi‑Object Tracking Accuracy (MOTA) was 98.7% and 99.3% in two benchmark video sequences, with IDF1 scores above 99% and near‑zero identity switches. This unified multi‑camera system can track dairy cows in complex interior surroundings in real time, according to our data. The system reduces redundant detections across overlapping cameras, maintains continuity as cows move between viewpoints, with the aim of improving early sickness prediction through activity quantification and behavioural classification.

Abstract:
Robot‑assisted surgeries rely on accurate and real‑time scene understanding to safely guide surgical instruments. However, segmentation models trained on static datasets face key limitations when deployed in these dynamic and evolving surgical environments. Class‑incremental semantic segmentation (CISS) allows models to continually adapt to new classes while avoiding catastrophic forgetting of prior knowledge, without training on previous data. In this work, we build upon the recently introduced Taxonomy‑Oriented Poincaré‑regularized Incremental Class Segmentation (TOPICS) approach and propose an enhanced variant, termed TOPICS+, specifically tailored for robust segmentation of surgical scenes. Concretely, we incorporate the Dice loss into the hierarchical loss formulation to handle strong class imbalances, introduce hierarchical pseudo‑labeling, and design tailored label taxonomies for robotic surgery environments. We also propose six novel CISS benchmarks designed for robotic surgery environments including multiple incremental steps and several semantic categories to emulate realistic class‑incremental settings in surgical environments. In addition, we introduce a refined set of labels with more than 144 classes on the Syn‑Mediverse synthetic dataset, hosted online as an evaluation benchmark. We make the code and trained models publicly available at http://topics.cs.uni‑freiburg.de.

Abstract:
Various adverse weather conditions such as fog and rain pose a significant challenge to autonomous driving (AD) perception tasks like semantic segmentation, object detection, etc. The common domain adaption strategy is to minimize the disparity between images captured in clear and adverse weather conditions. However, domain adaption faces two challenges: (I) it typically relies on utilizing clear image as a reference, which is challenging to obtain in practice; (II) it generally targets single adverse weather condition and performs poorly when confronting the mixture of multiple adverse weather conditions. To address these issues, we introduce a reference‑free and Adverse weather condition‑independent (Advent) framework (rather than a specific model architecture) that can be implemented by various backbones and heads. This is achieved by leveraging the homogeneity over short durations, getting rid of clear reference and being generalizable to arbitrary weather condition. Specifically, Advent includes three integral components: (I) Locally Sequential Mechanism (LSM) leverages temporal correlations between adjacent frames to achieve the weather‑condition‑agnostic effect thanks to the homogeneity behind arbitrary weather condition; (II) Globally Shuffled Mechanism (GSM) is proposed to shuffle segments processed by LSM from different positions of input sequence to prevent the overfitting to LSM‑induced temporal patterns; (III) Unfolded Regularizers (URs) are the deep unfolding implementation of two proposed regularizers to penalize the model complexity to enhance across‑weather generalization. We take the semantic segmentation task as an example to assess the proposed Advent framework. Extensive experiments demonstrate that the proposed Advent outperforms existing state‑of‑the‑art baselines with large margins.

Abstract:
Infrared thermography is emerging as a powerful tool in sports medicine, allowing assessment of thermal radiation during exercise and analysis of anatomical regions of interest, such as the well‑exposed calves. Building on our previous advanced automatic annotation method, we aimed to transfer the stereo‑ and multimodal‑based labeling approach from treadmill running to ergometer cycling. Therefore, the training of the semantic segmentation network with automatic labels and fine‑tuning on high‑quality manually annotated images has been examined and compared in different data set combinations. The results indicate that fine‑tuning with a small fraction of manual data is sufficient to improve the overall performance of the deep neural network. Finally, combining automatically generated labels with small manually annotated data sets accelerates the adaptation of deep neural networks to new use cases, such as the transition from treadmill to bicycle.

Abstract:
Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as object detection, semantic segmentation, depth estimation, trajectory prediction, motion prediction, and behaviour prediction, to ensure safe and reliable navigation in complex environments. Vehicle‑to‑everything (V2X) communication enables cooperative driving among CAVs, thereby mitigating the limitations of individual sensors, reducing occlusions, and improving perception over long distances. Traditionally, these tasks are addressed using distinct models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real‑time performance. Multi‑task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a single unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focused on MTL in the context of CAVs. We begin with an overview of CAVs and MTL to provide foundational background. We then explore the application of MTL across key functional modules, including perception, prediction, planning, control, and multi‑agent collaboration. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide directions for future research aimed at advancing MTL methodologies for CAV systems.

Abstract:
This study analyzes semantic segmentation performance across heterogeneously labeled point‑cloud datasets relevant to public safety applications, including pre‑incident planning systems derived from lidar scans. Using NIST's Point Cloud City dataset (Enfield and Memphis collections), we investigate challenges in unifying differently labeled 3D data. Our methodology employs a graded schema with the KPConv architecture, evaluating performance through IoU metrics on safety‑relevant features. Results indicate performance variability: geometrically large objects (e.g. stairs, windows) achieve higher segmentation performance, suggesting potential for navigational context, while smaller safety‑critical features exhibit lower recognition rates. Performance is impacted by class imbalance and the limited geometric distinction of smaller objects in typical lidar scans, indicating limitations in detecting certain safety‑relevant features using current point‑cloud methods. Key identified challenges include insufficient labeled data, difficulties in unifying class labels across datasets, and the need for standardization. Potential directions include automated labeling and multi‑dataset learning strategies. We conclude that reliable point‑cloud semantic segmentation for public safety necessitates standardized annotation protocols and improved labeling techniques to address data heterogeneity and the detection of small, safety‑critical elements.

Abstract:
Robot navigation in unstructured environments requires multimodal perception systems that can support safe navigation. Multimodality enables the integration of complementary information collected by different sensors. However, this information must be processed by machine learning algorithms specifically designed to leverage heterogeneous data. Furthermore, it is necessary to identify which sensor modalities are most informative for navigation in the target environment. In Martian exploration, thermal imagery has proven valuable for assessing terrain safety due to differences in thermal behaviour between soil types. This work presents OmniUnet, a transformer‑based neural network architecture for semantic segmentation using RGB, depth, and thermal (RGB‑D‑T) imagery. A custom multimodal sensor housing was developed using 3D printing and mounted on the Martian Rover Testbed for Autonomy (MaRTA) to collect a multimodal dataset in the Bardenas semi‑desert in northern Spain. This location serves as a representative environment of the Martian surface, featuring terrain types such as sand, bedrock, and compact soil. A subset of this dataset was manually labeled to support supervised training of the network. The model was evaluated both quantitatively and qualitatively, achieving a pixel accuracy of 80.37% and demonstrating strong performance in segmenting complex unstructured terrain. Inference tests yielded an average prediction time of 673 ms on a resource‑constrained computer (Jetson Orin Nano), confirming its suitability for on‑robot deployment. The software implementation of the network and the labeled dataset have been made publicly available to support future research in multimodal terrain perception for planetary robotics.

Abstract:
Fine‑tuning pre‑trained vision‑language models has emerged as a powerful approach for enhancing open‑vocabulary semantic segmentation (OVSS). However, the substantial computational and resource demands associated with training on large datasets have prompted interest in training‑free methods for OVSS. Existing training‑free approaches primarily focus on modifying model architectures and generating prototypes to improve segmentation performance. However, they often neglect the challenges posed by class redundancy, where multiple categories are not present in the current test image, and visual‑language ambiguity, where semantic similarities among categories create confusion in class activation. These issues can lead to suboptimal class activation maps and affinity‑refined activation maps. Motivated by these observations, we propose FreeCP, a novel training‑free class purification framework designed to address these challenges. FreeCP focuses on purifying semantic categories and rectifying errors caused by redundancy and ambiguity. The purified class representations are then leveraged to produce final segmentation predictions. We conduct extensive experiments across eight benchmarks to validate FreeCP's effectiveness. Results demonstrate that FreeCP, as a plug‑and‑play module, significantly boosts segmentation performance when combined with other OVSS methods.

Abstract:
We introduce a U‑net model for 360° acoustic source localization formulated as a spherical semantic segmentation task. Rather than regressing discrete direction‑of‑arrival (DoA) angles, our model segments beamformed audio maps (azimuth & elevation) into regions of active sound presence. Using delay‑and‑sum (DAS) beamforming on a custom 24‑microphone array, we generate signals aligned with drone GPS telemetry to create binary supervision masks. A modified U‑Net, trained on frequency‑domain representations of these maps, learns to identify spatially distributed source regions while addressing class imbalance via the Tversky loss. Because the network operates on beamformed energy maps, the approach is inherently array‑independent and can adapt to different microphone configurations and can be transferred to different microphone configurations with minimal adaptation. The segmentation outputs are post‑processed by computing centroids over activated regions, enabling robust DoA estimates. Our dataset includes real‑world open‑field recordings of a DJI Air 3 drone, synchronized with 360° video and flight logs across multiple dates and locations. Experimental results show that U‑net generalizes across environments, providing improved angular precision, offering a new paradigm for dense spatial audio understanding beyond traditional Sound Source Localization (SSL). We additionally validate the same beamforming‑plus‑segmentation formulation on the DCASE 2019 TAU Spatial Sound Events benchmark, showing that the approach generalizes beyond drone acoustics to multiclass Sound Event Localization and Detection (SELD) scenarios.

Abstract:
We introduce PointGauss, a novel point cloud‑guided framework for real‑time multi‑object segmentation in Gaussian Splatting representations. Unlike existing methods that suffer from prolonged initialization and limited multi‑view consistency, our approach achieves efficient 3D segmentation by directly parsing Gaussian primitives through a point cloud segmentation‑driven pipeline. The key innovation lies in two aspects: (1) a point cloud‑based Gaussian primitive decoder that generates 3D instance masks within 1 minute, and (2) a GPU‑accelerated 2D mask rendering system that ensures multi‑view consistency. Extensive experiments demonstrate significant improvements over previous state‑of‑the‑art methods, achieving performance gains of 1.89 to 31.78% in multi‑view mIoU, while maintaining superior computational efficiency. To address the limitations of current benchmarks (single‑object focus, inconsistent 3D evaluation, small scale, and partial coverage), we present DesktopObjects‑360, a novel comprehensive dataset for 3D segmentation in radiance fields, featuring: (1) complex multi‑object scenes, (2) globally consistent 2D annotations, (3) large‑scale training data (over 27 thousand 2D masks), (4) full 360° coverage, and (5) 3D evaluation masks.

Abstract:
Accurate mass estimation of table‑top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision‑based pipeline integrating RGB‑D sensing and deep learning to enable non‑destructive, real‑time and online mass estimation. The method employed YOLOv8‑Seg for instance segmentation, Cycle‑consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt‑angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9‑1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.

Abstract:
Unlike closed‑vocabulary 3D instance segmentation that is often trained end‑to‑end, open‑vocabulary 3D instance segmentation (OV‑3DIS) often leverages vision‑language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state‑of‑the‑art solution for OV‑3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two‑stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking‑based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha‑CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object‑centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text‑to‑proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state‑of‑the‑art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end‑to‑end closed‑vocabulary method.

Abstract:
Video Object Segmentation and Tracking (VOST) presents a complex yet critical challenge in computer vision, requiring robust integration of segmentation and tracking across temporally dynamic frames. Traditional methods have struggled with domain generalization, temporal consistency, and computational efficiency. The emergence of foundation models like the Segment Anything Model (SAM) and its successor, SAM2, has introduced a paradigm shift, enabling prompt‑driven segmentation with strong generalization capabilities. Building upon these advances, this survey provides a comprehensive review of SAM/SAM2‑based methods for VOST, structured along three temporal dimensions: past, present, and future. We examine strategies for retaining and updating historical information (past), approaches for extracting and optimizing discriminative features from the current frame (present), and motion prediction and trajectory estimation mechanisms for anticipating object dynamics in subsequent frames (future). In doing so, we highlight the evolution from early memory‑based architectures to the streaming memory and real‑time segmentation capabilities of SAM2. We also discuss recent innovations such as motion‑aware memory selection and trajectory‑guided prompting, which aim to enhance both accuracy and efficiency. Finally, we identify remaining challenges including memory redundancy, error accumulation, and prompt inefficiency, and suggest promising directions for future research. This survey offers a timely and structured overview of the field, aiming to guide researchers and practitioners in advancing the state of VOST through the lens of foundation models.

Abstract:
Recently, large foundation models trained on vast datasets have demonstrated exceptional capabilities in feature extraction and general feature representation. The ongoing advancements in deep learning‑driven large models have shown great promise in accelerating unsupervised change detection methods, thereby enhancing the practical applicability of change detection technologies. Building on this progress, this paper introduces MergeSAM, an innovative unsupervised change detection method for high‑resolution remote sensing imagery, based on the Segment Anything Model (SAM). Two novel strategies, MaskMatching and MaskSplitting, are designed to address real‑world complexities such as object splitting, merging, and other intricate changes. The proposed method fully leverages SAM's object segmentation capabilities to construct multitemporal masks that capture complex changes, embedding the spatial structure of land cover into the change detection process.

Abstract:
The shape of a cell contains essential information about its function within the biological system. Segmenting these structures from large‑scale 3D microscopy images is challenging, limiting clinical insights especially for microglia, immune‑associated cells involved in neurodegenerative diseases. Existing segmentation methods mainly focus on cell bodies, struggle with overlapping structures, perform poorly on noisy images, require hyperparameter tuning for each new dataset, or rely on tedious semi‑automated approaches. We introduce trAIce3D, a deep‑learning architecture designed for precise microglia segmentation, capturing both somas and branches. It employs a two‑stage approach: first, a 3D U‑Net with vision transformers in the encoder detects somas using a sliding‑window technique to cover the entire image. Then, the same architecture, enhanced with cross‑attention blocks in skip connections, refines each soma and its branches by using soma coordinates as a prompt and a 3D window around the target cell as input. Training occurs in two phases: self‑supervised Soma Segmentation, followed by prompt‑based Branch Segmentation, leveraging pre‑trained weights from the first phase. Trained and evaluated on a dataset of 41,230 microglial cells, trAIce3D significantly improves segmentation accuracy and generalization, enabling scalable analysis of complex cellular morphologies. While optimized for microglia, its architecture can extend to other intricate cell types, such as neurons and astrocytes, broadening its impact on neurobiological research.

Abstract:
In this article, we present a new unique dataset for dental research ‑ AlphaDent. This dataset is based on the DSLR camera photographs of the teeth of 295 patients and contains over 1200 images. The dataset is labeled for solving the instance segmentation problem and is divided into 9 classes. The article provides a detailed description of the dataset and the labeling format. The article also provides the details of the experiment on neural network training for the Instance Segmentation problem using this dataset. The results obtained show high quality of predictions. The dataset is published under an open license; and the training/inference code and model weights are also available under open licenses.

Abstract:
Rapid progress in terrain‑aware autonomous ground navigation has been driven by advances in supervised semantic segmentation. However, these methods rely on costly data collection and labor‑intensive ground truth labeling to train deep models. Furthermore, autonomous systems are increasingly deployed in unrehearsed, unstructured environments where no labeled data exists and semantic categories may be ambiguous or domain‑specific. Recent zero‑shot approaches to unsupervised segmentation have shown promise in such settings but typically operate on individual frames, lacking temporal consistency‑a critical property for robust perception in unstructured environments. To address this gap we introduce Frontier‑Seg, a method for temporally consistent unsupervised segmentation of terrain from mobile robot video streams. Frontier‑Seg clusters superpixel‑level features extracted from foundation model backbones‑specifically DINOv2‑and enforces temporal consistency across frames to identify persistent terrain boundaries or frontiers without human supervision. We evaluate Frontier‑Seg on a diverse set of benchmark datasets‑including RUGD and RELLIS‑3D‑demonstrating its ability to perform unsupervised segmentation across unstructured off‑road environments.

Abstract:
This work addresses motion‑guided few‑shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large‑scale dataset specifically designed for motion‑guided FSVOS. Based on MOVE, we comprehensively evaluate 6 state‑of‑the‑art methods from 3 different related tasks across 2 experimental settings. Our results reveal that current methods struggle to address motion‑guided FSVOS, prompting us to analyze the associated challenges and propose a baseline method, Decoupled Motion Appearance Network (DMA). Experiments demonstrate that our approach achieves superior performance in few shot motion understanding, establishing a solid foundation for future research in this direction.

Abstract:
Event‑based semantic segmentation explores the potential of event cameras, which offer high dynamic range and fine temporal resolution, to achieve robust scene understanding in challenging environments. Despite these advantages, the task remains difficult due to two main challenges: extracting reliable features from sparse and noisy event streams, and effectively fusing them with dense, semantically rich image data that differ in structure and representation. To address these issues, we propose EIFNet, a multi‑modal fusion network that combines the strengths of both event and frame‑based inputs. The network includes an Adaptive Event Feature Refinement Module (AEFRM), which improves event representations through multi‑scale activity modeling and spatial attention. In addition, we introduce a Modality‑Adaptive Recalibration Module (MARM) and a Multi‑Head Attention Gated Fusion Module (MGFM), which align and integrate features across modalities using attention mechanisms and gated fusion strategies. Experiments on DDD17‑Semantic and DSEC‑Semantic datasets show that EIFNet achieves state‑of‑the‑art performance, demonstrating its effectiveness in event‑based semantic segmentation.

Abstract:
Scarcity of pixel‑level labels is a significant challenge in practical scenarios. In specific domains like industrial smoke, acquiring such detailed annotations is particularly difficult and often requires expert knowledge. To alleviate this, weakly supervised semantic segmentation (WSSS) has emerged as a promising approach. However, due to the supervision gap and inherent bias in models trained with only image level labels, existing WSSS methods suffer from limitations such as incomplete foreground coverage, inaccurate object boundaries, and spurious correlations, especially in our domain, where emissions are always spatially coupled with chimneys. Previous solutions typically rely on additional priors or external knowledge to mitigate these issues, but they often lack scalability and fail to address the model's inherent bias toward co‑occurring context. To address this, we propose a novel WSSS framework that directly targets the co‑occurrence problem without relying on external supervision. Unlike prior methods that adopt a single network, we employ a teacher‑student framework that combines CNNs and ViTs. We introduce a knowledge transfer loss that enforces cross‑architecture consistency by aligning internal representations. Additionally, we incorporate post‑processing techniques to address partial coverage and further improve pseudo mask quality.

Abstract:
Medical image segmentation requires not only accuracy but also robustness under challenging imaging conditions. In this study, we show that a carefully configured DeepLabv3 model can achieve high performance in segmenting induced pluripotent stem (iPS) cell colonies, and, under our experimental conditions, outperforms large‑scale foundation models such as SAM2 and its medical variant MedSAM2 without structural modifications. These results suggest that, for specialized tasks characterized by subtle, low‑contrast boundaries, increased model complexity does not necessarily translate to better performance. Our work revisits the assumption that ever‑larger and more generalized architectures are always preferable, and provides evidence that appropriately adapted, simpler models may offer strong accuracy and practical reliability in domain‑specific biomedical applications. We also offer an open‑source implementation that includes strategies for small datasets and domain‑specific encoding, with the aim of supporting further advances in semantic segmentation for regenerative medicine and related fields.

Abstract:
Unlike fully supervised semantic segmentation, weakly supervised semantic segmentation (WSSS) relies on weaker forms of supervision to perform dense prediction tasks. Among the various types of weak supervision, WSSS with image level annotations is considered both the most challenging and the most practical, attracting significant research attention. Therefore, in this review, we focus on WSSS with image level annotations. Additionally, this review concentrates on mainstream research directions, deliberately omitting less influential branches. Given the rapid development of new methods and the limitations of existing surveys in capturing recent trends, there is a pressing need for an updated and comprehensive review. Our goal is to fill this gap by synthesizing the latest advancements and state‑of‑the‑art techniques in WSSS with image level labels. Basically, we provide a comprehensive review of recent advancements in WSSS with image level labels, categorizing existing methods based on the types and levels of additional supervision involved. We also examine the challenges of applying advanced methods to domain specific datasets in WSSS,a topic that remains underexplored. Finally, we discuss the current challenges, evaluate the limitations of existing approaches, and outline several promising directions for future research. This review is intended for researchers who are already familiar with the fundamental concepts of WSSS and are seeking to deepen their understanding of current advances and methodological innovations.

Abstract:
This work focuses on multi‑floor indoor exploration, which remains an open area of research. Compared to traditional methods, recent learning‑based explorers have demonstrated significant potential due to their robust environmental learning and modeling capabilities, but most are restricted to 2D environments. In this paper, we proposed a learning‑integrated topological explorer, LITE, for multi‑floor indoor environments. LITE decomposes the environment into a floor‑stair topology, enabling seamless integration of learning or non‑learning‑based 2D exploration methods for 3D exploration. As we incrementally build floor‑stair topology in exploration using YOLO11‑based instance segmentation model, the agent can transition between floors through a finite state machine. Additionally, we implement an attention‑based 2D exploration policy that utilizes an attention mechanism to capture spatial dependencies between different regions, thereby determining the next global goal for more efficient exploration. Extensive comparison and ablation studies conducted on the HM3D and MP3D datasets demonstrate that our proposed 2D exploration policy significantly outperforms all baseline explorers in terms of exploration efficiency. Furthermore, experiments in several 3D multi‑floor environments indicate that our framework is compatible with various 2D exploration methods, facilitating effective multi‑floor indoor exploration. Finally, we validate our method in the real world with a quadruped robot, highlighting its strong generalization capabilities.

Abstract:
Domain Generalized Semantic Segmentation (DGSS) is a critical yet challenging task, as domain shifts in unseen environments can severely compromise model performance. While recent studies enhance feature alignment by projecting features into the source domain, they often neglect intrinsic latent domain priors, leading to suboptimal results. In this paper, we introduce PDAF, a Probabilistic Diffusion Alignment Framework that enhances the generalization of existing segmentation networks through probabilistic diffusion modeling. PDAF introduces a Latent Domain Prior (LDP) to capture domain shifts and uses this prior as a conditioning factor to align both source and unseen target domains. To achieve this, PDAF integrates into a pre‑trained segmentation model and utilizes paired source and pseudo‑target images to simulate latent domain shifts, enabling LDP modeling. The framework comprises three modules: the Latent Prior Extractor (LPE) predicts the LDP by supervising domain shifts; the Domain Compensation Module (DCM) adjusts feature representations to mitigate domain shifts; and the Diffusion Prior Estimator (DPE) leverages a diffusion process to estimate the LDP without requiring paired samples. This design enables PDAF to iteratively model domain shifts, progressively refining feature representations to enhance generalization under complex target conditions. Extensive experiments validate the effectiveness of PDAF across diverse and challenging urban scenes.

Abstract:
Accurate detection of defects such as hotspots and snail trails in photovoltaic modules is essential for maintaining energy efficiency and system reliablility. This work presents a supervised deep learning framework for segmenting thermal infrared images of PV panels, using a dataset of 277 aerial thermographic images captured by zenmuse XT infrared camera mounted on a DJI Matrice 100 drone. The preprocessing pipeline includes image resizing, CLAHE based contrast enhancement, denoising, and normalisation. A lightweight semantic segmentation model based on SegFormer is developed, featuring a customised Transformwer encoder and streamlined decoder, and fine‑tuned on annotated images with manually labeled defect regions. To evaluate performance, we benchmark our model against U‑Net, DeepLabV3, PSPNet, and Mask2Former using consistent preprocessing and augmentation. Evaluation metrices includes per‑class Dice score, F1‑score, Cohen's kappa, mean IoU, and pixel accuracy. The SegFormer‑based model outperforms baselines in accuracy and efficiency, particularly for segmenting small and irregular defects. Its lightweight design real‑time deployment on edge devices and seamless integration with drone‑based systems for automated inspection of large‑scale solar farms.

Abstract:
MRI tumor segmentation remains a critical challenge in medical imaging, where volumetric analysis faces unique computational demands due to the complexity of 3D data. The spatially sequential arrangement of adjacent MRI slices provides valuable information that enhances segmentation continuity and accuracy, yet this characteristic remains underutilized in many existing models. The spatial correlations between adjacent MRI slices can be regarded as "temporal‑like" data, similar to frame sequences in video segmentation tasks. To bridge this gap, we propose M‑Net, a flexible framework specifically designed for sequential image segmentation. M‑Net introduces the novel Mesh‑Cast mechanism, which seamlessly integrates arbitrary sequential models into the processing of both channel and temporal information, thereby systematically capturing the inherent "temporal‑like" spatial correlations between MRI slices. Additionally, we define an MRI sequential input pattern and design a Two‑Phase Sequential (TPS) training strategy, which first focuses on learning common patterns across sequences before refining slice‑specific feature extraction. This approach leverages temporal modeling techniques to preserve volumetric contextual information while avoiding the high computational cost of full 3D convolutions, thereby enhancing the generalizability and robustness of M‑Net in sequential segmentation tasks. Experiments on the BraTS2019 and BraTS2023 datasets demonstrate that M‑Net outperforms existing methods across all key metrics, establishing itself as a robust solution for temporally‑aware MRI tumor segmentation.

Abstract:
Autonomous vehicles are the next revolution in the automobile industry and they are expected to revolutionize the future of transportation. Understanding the scenario in which the autonomous vehicle will operate is critical for its competent functioning. Deep Learning has played a massive role in the progress that has been made till date. Semantic Segmentation, the process of annotating every pixel of an image with an object class, is one crucial part of this scene comprehension using Deep Learning. It is especially useful in Autonomous Driving Research as it requires comprehension of drivable and non‑drivable areas, roadside objects and the like. In this paper semantic segmentation has been performed on the Indian Driving Dataset which has been recently compiled on the urban and rural roads of Bengaluru and Hyderabad. This dataset is more challenging compared to other datasets like Cityscapes, since it is based on unstructured driving environments. It has a four level hierarchy and in this paper segmentation has been performed on the first level. Five different models have been trained and their performance has been compared using the Mean Intersection over Union. These are UNET, UNET+RESNET50, DeepLabsV3, PSPNet and SegNet. The highest MIOU of 0.6496 has been achieved. The paper discusses the dataset, exploratory data analysis, preparation, implementation of the five models and studies the performance and compares the results achieved in the process.

Abstract:
We introduce an end‑to‑end pipeline that leverages a fine‑tuned YOLO12l‑seg model ‑‑ trained on over 500 annotated post‑blast images ‑‑ to deliver real‑time instance segmentation (Box mAP@0.5 ~ 0.769, Mask mAP@0.5 ~ 0.800 at ~ 15 FPS). High‑fidelity masks are converted into normalized 3D coordinates, from which we extract multi‑metric spatial descriptors: principal component directions, kernel density hotspots, size‑depth regression, and Delaunay edge statistics. We present four representative examples to illustrate key fragmentation patterns. Experimental results confirm the framework's accuracy, robustness to small‑object crowding, and feasibility for rapid, automated blast‑effect assessment in field conditions.

Abstract:
Online video segmentation methods excel at handling long sequences and capturing gradual changes, making them ideal for real‑world applications. However, achieving temporally consistent predictions remains a challenge, especially with gradual accumulation of noise or drift in on‑line propagation, abrupt occlusions and scene transitions. This paper introduces Local2Global, an online framework, for video instance segmentation, exhibiting state‑of‑the‑art performance with simple baseline and training purely in online fashion. Leveraging the DETR‑based query propagation framework, we introduce two novel sets of queries:(1) local queries that capture initial object‑specific spatial features from each frame and (2) global queries containing past spatio‑temporal representations. We propose the L2G‑aligner, a novel lightweight transformer decoder, to facilitate an early alignment between local and global queries. This alignment allows our model to effectively utilize current frame information while maintaining temporal consistency, producing a smooth transition between frames. Furthermore, L2G‑aligner is integrated within the segmentation model, without relying on additional complex heuristics, or memory mechanisms. Extensive experiments across various challenging VIS and VPS datasets showcase the superiority of our method with simple online training, surpassing current benchmarks without bells and rings. For instance, we achieve 54.3 and 49.4 AP on Youtube‑VIS‑19/‑21 datasets and 37.0 AP on OVIS dataset respectively withthe ResNet‑50 backbone.

Abstract:
Despite significant advancements in computer vision, semantic segmentation models may be susceptible to backdoor attacks. These attacks, involving hidden triggers, aim to cause the models to misclassify instances of the victim class as the target class when triggers are present, posing serious threats to the reliability of these models. To further explore the field of backdoor attacks against semantic segmentation, in this paper, we propose a simple yet effective backdoor attack called Contextual Segmentation Backdoor Attack (ConSeg). ConSeg leverages the contextual information inherent in semantic segmentation models to enhance backdoor performance. Our method is motivated by an intriguing observation, i.e., when the target class is set as the `co‑occurring' class of the victim class, the victim class can be more easily `mis‑segmented'. Building upon this insight, ConSeg mimics the contextual information of the target class and rebuilds it in the victim region to establish the contextual relationship between the target class and the victim class, making the attack easier. Our experiments reveal that ConSeg achieves improvements in Attack Success Rate (ASR) with increases of 15.55%, compared to existing methods, while exhibiting resilience against state‑of‑the‑art backdoor defenses.

Abstract:
In the past, continual learning (CL) was mostly concerned with the problem of catastrophic forgetting in neural networks, that arises when incrementally learning a sequence of tasks. Current CL methods function within the confines of limited data access, without any restrictions imposed on computational resources. However, in real‑world scenarios, the latter takes precedence as deployed systems are often computationally constrained. A major drawback of most CL methods is the need to retrain the entire model for each new task. The computational demands of retraining large models can be prohibitive, limiting the applicability of CL in environments with limited resources. Through CLoRA, we explore the applicability of Low‑Rank Adaptation (LoRA), a parameter‑efficient fine‑tuning method for class‑incremental semantic segmentation. CLoRA leverages a small set of parameters of the model and uses the same set for learning across all tasks. Results demonstrate the efficacy of CLoRA, achieving performance on par with and exceeding the baseline methods. We further evaluate CLoRA using NetScore, underscoring the need to factor in resource efficiency and evaluate CL methods beyond task performance. CLoRA significantly reduces the hardware requirements for training, making it well‑suited for CL in resource‑constrained environments after deployment.

Abstract:
Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple clients without sharing raw data. However, its potential in the semantic segmentation of autonomous driving remains underexplored. In this paper, we propose FedS2R, the first one‑shot federated domain generalization framework for synthetic‑to‑real semantic segmentation in autonomous driving. FedS2R comprises two components: an inconsistency‑driven data augmentation strategy that generates images for unstable classes, and a multi‑client knowledge distillation scheme with feature fusion that distills a global model from multiple client models. Experiments on five real‑world datasets, Cityscapes, BDD100K, Mapillary, IDD, and ACDC, show that the global model significantly outperforms individual client models and is only 2 mIoU points behind the model trained with simultaneous access to all client data. These results demonstrate the effectiveness of FedS2R in synthetic‑to‑real semantic segmentation for autonomous driving under federated learning

Abstract:
Entrusted with the goal of pixel‑level object classification, the semantic segmentation networks entail the laborious preparation of pixel‑level annotation masks. To obtain pixel‑level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text‑to‑image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text‑guided Diffusion models and thus require a pre‑trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross‑attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic‑correspondence‑modeling capability of self‑attention to iteratively spread the attention to the whole class from the seeds using multi‑scale self‑attention maps. We also observe that a simple‑text‑guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex‑structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high‑quality masks off‑the‑shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre‑trained segmentation network.

Abstract:
Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two‑stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large‑scale image‑mask pairs into image‑flow‑mask training pairs, dramatically expanding the data available for network training. By training a simple encoder‑decoder architecture with our synthesized data, we achieve new state‑of‑the‑art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.

Abstract:
Accurate perception and scene understanding in complex urban environments is a critical challenge for ensuring safe and efficient autonomous navigation. In this paper, we present Co‑Win, a novel bird's eye view (BEV) perception framework that integrates point cloud encoding with efficient parallel window‑based feature extraction to address the multi‑modality inherent in environmental understanding. Our method employs a hierarchical architecture comprising a specialized encoder, a window‑based backbone, and a query‑based decoder head to effectively capture diverse spatial features and object relationships. Unlike prior approaches that treat perception as a simple regression task, our framework incorporates a variational approach with mask‑based instance segmentation, enabling fine‑grained scene decomposition and understanding. The Co‑Win architecture processes point cloud data through progressive feature extraction stages, ensuring that predicted masks are both data‑consistent and contextually relevant. Furthermore, our method produces interpretable and diverse instance predictions, enabling enhanced downstream decision‑making and planning in autonomous driving systems.

Abstract:
Consistent surgical instrument segmentation is critical for automation in robot‑assisted surgery. Yet, existing methods only treat instrument‑level instance segmentation (IIS) or part‑level semantic segmentation (PSS) separately, without interaction between these tasks. In this work, we formulate a surgical tool segmentation as a unified part‑aware instance segmentation (PIS) problem and introduce SurgPIS, the first PIS model for surgical instruments. Our method adopts a transformer‑based mask classification approach and introduces part‑specific queries derived from instrument‑level object queries, explicitly linking parts to their parent instrument instances. In order to address the lack of large‑scale datasets with both instance‑ and part‑level labels, we propose a weakly‑supervised learning strategy for SurgPIS to learn from disjoint datasets labelled for either IIS or PSS purposes. During training, we aggregate our PIS predictions into IIS or PSS masks, thereby allowing us to compute a loss against partially labelled datasets. A student‑teacher approach is developed to maintain prediction consistency for missing PIS information in the partially labelled data, e.g., parts of the IIS labelled data. Extensive experiments across multiple datasets validate the effectiveness of SurgPIS, achieving state‑of‑the‑art performance in PIS as well as IIS, PSS, and instrument‑level semantic segmentation.

Abstract:
Given an object mask, Semi‑supervised Video Object Segmentation (SVOS) technique aims to track and segment the object across video frames, serving as a fundamental task in computer vision. Although recent memory‑based methods demonstrate potential, they often struggle with scenes involving occlusion, particularly in handling object interactions and high feature similarity. To address these issues and meet the real‑time processing requirements of downstream applications, in this paper, we propose a novel bOundary Amendment video object Segmentation method with Inherent Structure refinement, hereby named OASIS. Specifically, a lightweight structure refinement module is proposed to enhance segmentation accuracy. With the fusion of rough edge priors captured by the Canny filter and stored object features, the module can generate an object‑level structure map and refine the representations by highlighting boundary features. Evidential learning for uncertainty estimation is introduced to further address challenges in occluded regions. The proposed method, OASIS, maintains an efficient design, yet extensive experiments on challenging benchmarks demonstrate its superior performance and competitive inference speed compared to other state‑of‑the‑art methods, i.e., achieving the F values of 91.6 (vs. 89.7 on DAVIS‑17 validation set) and G values of 86.6 (vs. 86.2 on YouTubeVOS 2019 validation set) while maintaining a competitive speed of 48 FPS on DAVIS.

Abstract:
Video Object Segmentation (VOS) is foundational to numerous computer vision applications, including surveillance, autonomous driving, robotics and generative video editing. However, existing VOS models often struggle with precise mask delineation, deformable objects, topologically transforming objects, tracking drift and long video sequences. In this paper, we introduce HQ‑SMem, for High Quality video segmentation and tracking using Smart Memory, a novel method that enhances the performance of VOS base models by addressing these limitations. Our approach incorporates three key innovations: (i) leveraging SAM with High‑Quality masks (SAM‑HQ) alongside appearance‑based candidate‑selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones, thereby optimizing memory usage and processing efficiency for long‑term videos; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video. These contributions mitigate several limitations of existing VOS models including, coarse segmentations that mix‑in background pixels, fixed memory update schedules, brittleness to drift and occlusions, and prompt ambiguity issues associated with SAM. Extensive experiments conducted on multiple public datasets and state‑of‑the‑art base trackers demonstrate that our method consistently ranks among the top two on VOTS and VOTSt 2024 datasets. Moreover, HQ‑SMem sets new benchmarks on Long Video Dataset and LVOS, showcasing its effectiveness in challenging scenarios characterized by complex multi‑object dynamics over extended temporal durations.

Abstract:
The poultry industry has been driven by broiler chicken production and has grown into the world's largest animal protein sector. Automated detection of chicken carcasses on processing lines is vital for quality control, food safety, and operational efficiency in slaughterhouses and poultry processing plants. However, developing robust deep learning models for tasks like instance segmentation in these fast‑paced industrial environments is often hampered by the need for laborious acquisition and annotation of large‑scale real‑world image datasets. We present the first pipeline generating photo‑realistic, automatically labeled synthetic images of chicken carcasses. We also introduce a new benchmark dataset containing 300 annotated real‑world images, curated specifically for poultry segmentation research. Using these datasets, this study investigates the efficacy of synthetic data and automatic data annotation to enhance the instance segmentation of chicken carcasses, particularly when real annotated data from the processing line is scarce. A small real dataset with varying proportions of synthetic images was evaluated in prominent instance segmentation models. Results show that synthetic data significantly boosts segmentation performance for chicken carcasses across all models. This research underscores the value of synthetic data augmentation as a viable and effective strategy to mitigate data scarcity, reduce manual annotation efforts, and advance the development of robust AI‑driven automated detection systems for chicken carcasses in the poultry processing industry.

Abstract:
In this work, we address the problem of semantic object segmentation using foundation models. We investigate whether foundation models, trained on a large number and variety of objects, can perform object segmentation without fine‑tuning on specific images containing everyday objects, but in highly cluttered visual scenes. The ''in the wild'' context is driven by the target application of vision guided upper limb neuroprostheses. We propose a method for generating prompts based on gaze fixations to guide the Segment Anything Model (SAM) in our segmentation scenario, and fine‑tune it on egocentric visual data. Evaluation results of our approach show an improvement of the IoU segmentation quality metric by up to 0.51 points on real‑world challenging data of Grasping‑in‑the‑Wild corpus which is made available on the RoboFlow Platform (https://universe.roboflow.com/iwrist/grasping‑in‑the‑wild)

Abstract:
Acquiring and annotating large datasets in ultrasound imaging is challenging due to low contrast, high noise, and susceptibility to artefacts. This process requires significant time and clinical expertise. Self‑supervised learning (SSL) offers a promising solution by leveraging unlabelled data to learn useful representations, enabling improved segmentation performance when annotated data is limited. Recent state‑of‑the‑art developments in SSL for video data include V‑JEPA, a framework solely based on feature prediction, avoiding pixel level reconstruction or negative samples. We hypothesise that V‑JEPA is well‑suited to ultrasound imaging, as it is less sensitive to noisy pixel‑level detail while effectively leveraging temporal information. To the best of our knowledge, this is the first study to adopt V‑JEPA for ultrasound video data. Similar to other patch‑based masking SSL techniques such as VideoMAE, V‑JEPA is well‑suited to ViT‑based models. However, ViTs can underperform on small medical datasets due to lack of inductive biases, limited spatial locality and absence of hierarchical feature learning. To improve locality understanding, we propose a novel 3D localisation auxiliary task to improve locality in ViT representations during V‑JEPA pre‑training. Our results show V‑JEPA with our auxiliary task improves segmentation performance significantly across various frozen encoder configurations, with gains up to 3.4% using 100% and up to 8.35% using only 10% of the training data.

Abstract:
Aviation's climate impact includes not only CO2 emissions but also significant non‑CO2 effects, especially from contrails. These ice clouds can alter Earth's radiative balance, potentially rivaling the warming effect of aviation CO2. Physics‑based models provide useful estimates of contrail formation and climate impact, but their accuracy depends heavily on the quality of atmospheric input data and on assumptions used to represent complex processes like ice particle formation and humidity‑driven persistence. Observational data from remote sensors, such as satellites and ground cameras, could be used to validate and calibrate these models. However, existing datasets don't explore all aspect of contrail dynamics and formation: they typically lack temporal tracking, and do not attribute contrails to their source flights. To address these limitations, we present the Ground Visible Camera Contrail Sequences (GVCCS), a new open data set of contrails recorded with a ground‑based all‑sky camera in the visible range. Each contrail is individually labeled and tracked over time, allowing a detailed analysis of its lifecycle. The dataset contains 122 video sequences (24,228 frames) and includes flight identifiers for contrails that form above the camera. As reference, we also propose a unified deep learning framework for contrail analysis using a panoptic segmentation model that performs semantic segmentation (contrail pixel identification), instance segmentation (individual contrail separation), and temporal tracking in a single architecture. By providing high‑quality, temporally resolved annotations and a benchmark for model evaluation, our work supports improved contrail monitoring and will facilitate better calibration of physical models. This sets the groundwork for more accurate climate impact understanding and assessments.

Abstract:
Addressing performance degradation in 3D LiDAR semantic segmentation due to domain shifts (e.g., sensor type, geographical location) is crucial for autonomous systems, yet manual annotation of target data is prohibitive. This study addresses the challenge using Unsupervised Domain Adaptation (UDA) and introduces a novel two‑stage framework to tackle it. Initially, unsupervised contrastive learning at the segment level is used to pre‑train a backbone network, enabling it to learn robust, domain‑invariant features without labels. Subsequently, a multi‑model pseudo‑labeling strategy is introduced, utilizing an ensemble of diverse state‑of‑the‑art architectures (including projection, voxel, hybrid, and cylinder‑based methods). Predictions from these models are aggregated via hard voting to generate high‑quality, refined pseudo‑labels for the unlabeled target domain, mitigating single‑model biases. The contrastively pre‑trained network is then fine‑tuned using these robust pseudo‑labels. Experiments adapting from SemanticKITTI to unlabeled target datasets (SemanticPOSS, SemanticSlamantic) demonstrate significant improvements in segmentation accuracy compared to direct transfer and single‑model UDA approaches. These results highlight the effectiveness of combining contrastive pre‑training with refined ensemble pseudo‑labeling for bridging complex domain gaps without requiring target domain annotations.

Abstract:
State‑of‑the‑art methods for semantic segmentation are based on deep neural networks trained on large‑scale labeled datasets. Acquiring such datasets would incur large annotation costs, especially for dense pixel‑level prediction tasks like semantic segmentation. We consider region‑based active learning as a strategy to reduce annotation costs while maintaining high performance. In this setting, batches of informative image regions instead of entire images are selected for labeling. Importantly, we propose that enforcing local spatial diversity is beneficial for active learning in this case, and to incorporate spatial diversity along with the traditional active selection criterion, e.g., data sample uncertainty, in a unified optimization framework for region‑based active learning. We apply this framework to the Cityscapes and PASCAL VOC datasets and demonstrate that the inclusion of spatial diversity effectively improves the performance of uncertainty‑based and feature diversity‑based active learning methods. Our framework achieves 95% performance of fully supervised methods with only 5‑9% of the labeled pixels, outperforming all state‑of‑the‑art region‑based active learning methods for semantic segmentation.

Abstract:
The development of X‑Ray microscopy (XRM) technology has enabled non‑destructive inspection of semiconductor structures for defect identification. Deep learning is widely used as the state‑of‑the‑art approach to perform visual analysis tasks. However, deep learning based models require large amount of annotated data to train. This can be time‑consuming and expensive to obtain especially for dense prediction tasks like semantic segmentation. In this work, we explore active learning (AL) as a potential solution to alleviate the annotation burden. We identify two unique challenges when applying AL on semiconductor XRM scans: large domain shift and severe class‑imbalance. To address these challenges, we propose to perform contrastive pretraining on the unlabelled data to obtain the initialization weights for each AL cycle, and a rareness‑aware acquisition function that favors the selection of samples containing rare classes. We evaluate our method on a semiconductor dataset that is compiled from XRM scans of high bandwidth memory structures composed of logic and memory dies, and demonstrate that our method achieves state‑of‑the‑art performance.

Abstract:
In the field of food image processing, efficient semantic segmentation techniques are crucial for industrial applications. However, existing large‑scale Transformer‑based models (such as FoodSAM) face challenges in meeting practical deploymentrequirements due to their massive parameter counts and high computational resource demands. This paper introduces TUNable Adapter module (Swin‑TUNA), a Parameter Efficient Fine‑Tuning (PEFT) method that integrates multiscale trainable adapters into the Swin Transformer architecture, achieving high‑performance food image segmentation by updating only 4% of the parameters. The core innovation of Swin‑TUNA lies in its hierarchical feature adaptation mechanism: it designs separable convolutions in depth and dimensional mappings of varying scales to address the differences in features between shallow and deep networks, combined with a dynamic balancing strategy for tasks‑agnostic and task‑specific features. Experiments demonstrate that this method achieves mIoU of 50.56% and 74.94% on the FoodSeg103 and UECFoodPix Complete datasets, respectively, surpassing the fully parameterized FoodSAM model while reducing the parameter count by 98.7% (to only 8.13M). Furthermore, Swin‑TUNA exhibits faster convergence and stronger generalization capabilities in low‑data scenarios, providing an efficient solution for assembling lightweight food image.

Abstract:
The significant morphological and distributional variability among subcellular components poses a long‑standing challenge for learning‑based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine‑grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre‑trained SAM with Masked Autoencoder (MAE)‑guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre‑trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix‑based class prompt encoder to activate class‑specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state‑of‑the‑art methods.

Abstract:
Transforming in‑situ transmission electron microscopy (TEM) imaging into a tool for spatially‑resolved operando characterization of solid‑state reactions requires automated, high‑precision semantic segmentation of dynamically evolving features. However, traditional deep learning methods for semantic segmentation often encounter limitations due to the scarcity of labeled data, visually ambiguous features of interest, and small‑object scenarios. To tackle these challenges, we introduce MultiTaskDeltaNet (MTDN), a novel deep learning architecture that creatively reconceptualizes the segmentation task as a change detection problem. By implementing a unique Siamese network with a U‑Net backbone and using paired images to capture feature changes, MTDN effectively utilizes minimal data to produce high‑quality segmentations. Furthermore, MTDN utilizes a multi‑task learning strategy to leverage correlations between physical features of interest. In an evaluation using data from in‑situ environmental TEM (ETEM) videos of filamentous carbon gasification, MTDN demonstrated a significant advantage over conventional segmentation models, particularly in accurately delineating fine structural features. Notably, MTDN achieved a 10.22% performance improvement over conventional segmentation models in predicting small and visually ambiguous physical features. This work bridges several key gaps between deep learning and practical TEM image analysis, advancing automated characterization of nanomaterials in complex experimental settings.

Abstract:
Conventional approaches to video segmentation are confined to predefined object categories and cannot identify out‑of‑vocabulary objects, let alone objects that are not identified explicitly but only referred to implicitly in complex text queries. This shortcoming limits the utility for video segmentation in complex and variable scenarios, where a closed set of object categories is difficult to define and where users may not know the exact object category that will appear in the video. Such scenarios can arise in operating room video analysis, where different health systems may use different workflows and instrumentation, requiring flexible solutions for video analysis. Reasoning segmentation (RS) now offers promise towards such a solution, enabling natural language text queries as interaction for identifying object to segment. However, existing video RS formulation assume that target objects remain contextually relevant throughout entire video sequences. This assumption is inadequate for real‑world scenarios in which objects of interest appear, disappear or change relevance dynamically based on temporal context, such as surgical instruments that become relevant only during specific procedural phases or anatomical structures that gain importance at particular moments during surgery. Our first contribution is the introduction of temporally‑constrained video reasoning segmentation, a novel task formulation that requires models to implicitly infer when target objects become contextually relevant based on text queries that incorporate temporal reasoning. Since manual annotation of temporally‑constrained video RS datasets would be expensive and limit scalability, our second contribution is an innovative automated benchmark construction method. Finally, we present TCVideoRSBenchmark, a temporally‑constrained video RS dataset containing 52 samples using the videos from the MVOR dataset.

Abstract:
When preoperative planning for surgeries is conducted on the basis of medical images, artificial intelligence methods can support medical doctors during assessment. In this work, we consider medical guidelines for preoperative planning of the transcatheter aortic valve replacement (TAVR) and identify tasks, that may be supported via semantic segmentation models by making relevant anatomical structures measurable in computed tomography scans. We first derive fine‑grained TAVR‑relevant pseudo‑labels from coarse‑grained anatomical information, in order to train segmentation models and quantify how well they are able to find these structures in the scans. Furthermore, we propose an adaptation to the loss function in training these segmentation models and through this achieve a +1.27% Dice increase in performance. Our fine‑grained TAVR‑relevant pseudo‑labels and the computed tomography scans we build upon are available at https://doi.org/10.5281/zenodo.16274176.

Authors: Tobias Rueckert, David Rauber, Raphaela Maerkl, Leonard Klausmann, Suemeyye R. Yildiran, Max Gutbrod, Danilo Weber Nunes, Alvaro Fernandez Moreno, Imanol Luengo, Danail Stoyanov, Nicolas Toussaint, Enki Cho, Hyeon Bae Kim, Oh Sung Choo, Ka Young Kim, Seong Tae Kim, Gonçalo Arantes, Kehan Song, Jianjun Zhu, Junchen Xiong, Tingyi Lin, Shunsuke Kikuchi, Hiroki Matsuzaki, Atsushi Kouno, João Renato Ribeiro Manesco, João Paulo Papa, Tae-Min Choi, Tae Kyeong Jeong, Juyoun Park, Oluwatosin Alabi, Meng Wei, Tom Vercauteren, Runzhi Wu, Mengya Xu, An Wang, Long Bai, Hongliang Ren, Amine Yamlahi, Jakob Hennighausen, Lena Maier-Hein, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Shu Yang, Yihui Wang, Hao Chen, Santiago Rodríguez, Nicolás Aparicio, Leonardo Manrique, Juan Camilo Lyons, Olivia Hosie, Nicolás Ayobi, Pablo Arbeláez, Yiping Li, Yasmina Al Khalil, Sahar Nasirihaghighi, Stefanie Speidel, Daniel Rueckert, Hubertus Feussner, Dirk Wilhelm, Christoph Palm

Abstract:
Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer‑ and robot‑assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real‑world conditions remains a significant challenge. Incorporating surgical context ‑ such as the current procedural phase ‑ has emerged as a promising strategy to improve robustness and interpretability. To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub‑challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi‑center dataset comprising thirteen full‑length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures. We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub‑challenge advances the field by providing a unique benchmark for developing temporally aware, context‑driven methods in RAMIS and offers a high‑quality resource to support future research in surgical scene understanding.

Abstract:
Obtaining pixel‑level annotations in the medical domain is both expensive and time‑consuming, often requiring close collaboration between clinical experts and developers. Semi‑supervised medical image segmentation aims to leverage limited annotated data alongside abundant unlabeled data to achieve accurate segmentation. However, existing semi‑supervised methods often struggle to structure semantic distributions in the latent space due to noise introduced by pseudo‑labels. In this paper, we propose a novel diffusion‑based framework for semi‑supervised medical image segmentation. Our method introduces a constraint into the latent structure of semantic labels during the denoising diffusion process by enforcing prototype‑based contrastive consistency. Rather than explicitly delineating semantic boundaries, the model leverages class prototypes centralized semantic representations in the latent space as anchors. This strategy improves the robustness of dense predictions, particularly in the presence of noisy pseudo‑labels. We also introduce a new publicly available benchmark: Multi‑Object Segmentation in X‑ray Angiography Videos (MOSXAV), which provides detailed, manually annotated segmentation ground truth for multiple anatomical structures in X‑ray angiography videos. Extensive experiments on the EndoScapes2023 and MOSXAV datasets demonstrate that our method outperforms state‑of‑the‑art medical image segmentation approaches under the semi‑supervised learning setting. This work presents a robust and data‑efficient diffusion model that offers enhanced flexibility and strong potential for a wide range of clinical applications.

Abstract:
Semantic segmentation in remote sensing (RS) has advanced significantly with the incorporation of multi‑modal data, particularly the integration of RGB imagery and the Digital Surface Model (DSM), which provides complementary contextual and structural information about the ground object. However, integrating RGB and DSM often faces two major limitations: increased computational complexity due to architectural redundancy, and degraded segmentation performance caused by modality misalignment. These issues undermine the efficiency and robustness of semantic segmentation, particularly in complex urban environments where precise multi‑modal integration is essential. To overcome these limitations, we propose Asymmetric Multi‑Modal Network (AMMNet), a novel asymmetric architecture that achieves robust and efficient semantic segmentation through three designs tailored for RGB‑DSM input pairs. To reduce architectural redundancy, the Asymmetric Dual Encoder (ADE) module assigns representational capacity based on modality‑specific characteristics, employing a deeper encoder for RGB imagery to capture rich contextual information and a lightweight encoder for DSM to extract sparse structural features. Besides, to facilitate modality alignment, the Asymmetric Prior Fuser (APF) integrates a modality‑aware prior matrix into the fusion process, enabling the generation of structure‑aware contextual features. Additionally, the Distribution Alignment (DA) module enhances cross‑modal compatibility by aligning feature distributions through divergence minimization. Extensive experiments on the ISPRS Vaihingen and Potsdam datasets demonstrate that AMMNet attains state‑of‑the‑art segmentation accuracy among multi‑modal networks while reducing computational and memory requirements.

Abstract:
RGB‑based semantic segmentation has become a mainstream approach for visual perception and is widely applied in a variety of downstream tasks. However, existing methods typically rely on high‑resolution RGB inputs, which may expose sensitive visual content in privacy‑critical environments. Ultra‑low‑resolution RGB sensing suppresses sensitive information directly during image acquisition, making it an attractive privacy‑preserving alternative. Nevertheless, recovering semantic segmentation from ultra‑low‑resolution RGB inputs remains highly challenging due to severe visual degradation. In this work, we introduce a novel fully joint‑learning framework to mitigate the optimization conflicts exacerbated by visual degradation for ultra‑low‑resolution semantic segmentation. Experiments demonstrate that our method outperforms representative baselines in semantic segmentation performance and our ultra‑low‑resolution RGB input achieves a favorable trade‑off between privacy preservation and semantic segmentation performance. We deploy our privacy‑preserving semantic segmentation method in a real‑world robotic object‑goal navigation task, demonstrating successful downstream task execution even under severe visual degradation.

Abstract:
Pixel‑level vision tasks, such as semantic segmentation, require extensive and high‑quality annotated data, which is costly to obtain. Semi‑supervised semantic segmentation (SSSS) has emerged as a solution to alleviate the labeling burden by leveraging both labeled and unlabeled data through self‑training techniques. Meanwhile, the advent of foundational segmentation models pre‑trained on massive data, has shown the potential to generalize across domains effectively. This work explores whether a foundational segmentation model can address label scarcity in the pixel‑level vision task as an annotator for unlabeled images. Specifically, we investigate the efficacy of using SEEM, a Segment Anything Model (SAM) variant fine‑tuned for textual input, to generate predictive masks for unlabeled data. To address the shortcomings of using SEEM‑generated masks as supervision, we propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain's labeled data and then filters out unreliable pixel labels of unlabeled data so that only high‑confidence labels are used as supervision. By leveraging conformal prediction (CP) to adapt foundation models to target data through uncertainty calibration, ConformalSAM exploits the strong capability of the foundational segmentation model reliably which benefits the early‑stage learning, while a subsequent self‑reliance training strategy mitigates overfitting to SEEM‑generated masks in the later training stage. Our experiment demonstrates that, on three standard benchmarks of SSSS, ConformalSAM achieves superior performance compared to recent SSSS methods and helps boost the performance of those methods as a plug‑in.

Abstract:
In convolutional neural networks (CNNs), downsampling operations are crucial to model performance. Although traditional downsampling methods (such as maximum pooling and cross‑row convolution) perform well in feature aggregation, receptive field expansion, and computational reduction, they may lead to the loss of key spatial information in semantic segmentation tasks, thereby affecting the pixel‑by‑pixel prediction accuracy.To this end, this study proposes a downsampling method based on information complementarity ‑ Hybrid Pooling Downsampling (HPD). The core is to replace the traditional method with MinMaxPooling, and effectively retain the light and dark contrast and detail features of the image by extracting the maximum value information of the local area.Experiment on various CNN architectures on the ACDC and Synapse datasets show that HPD outperforms traditional methods in segmentation performance, and increases the DSC coefficient by 0.5% on average. The results show that the HPD module provides an efficient solution for semantic segmentation tasks.

Abstract:
Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid‑resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision agriculture.In this paper, we introduce a more fine‑grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine‑grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manual annotation. GTPBD comprises 47,537 high‑resolution images with three‑level labels, including pixel‑level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the world.Compared to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction, and unsupervised domain adaptation (UDA) tasks.Accordingly, we benchmark the GTPBD dataset on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods, and five UDA methods, along with a multi‑dimensional evaluation framework integrating pixel‑level and object‑level metrics. GTPBD fills a critical gap in terraced remote sensing research, providing a basic infrastructure for fine‑grained agricultural terrain analysis and cross‑scenario knowledge transfer.

Abstract:
Quantifying post‑consumer food waste in institutional dining settings is essential for supporting data‑driven sustainability strategies. This study presents a cost‑effective computer vision framework that estimates plate‑level food waste by utilizing semantic segmentation of RGB images taken before and after meal consumption across five Iranian dishes. Four fully supervised models (U‑Net, U‑Net++, and their lightweight variants) were trained using a capped dynamic inverse‑frequency loss and AdamW optimizer, then evaluated through a comprehensive set of metrics, including Pixel Accuracy, Dice, IoU, and a custom‑defined Distributional Pixel Agreement (DPA) metric tailored to the task. All models achieved satisfying performance, and for each food type, at least one model approached or surpassed 90% DPA, demonstrating strong alignment in pixel‑wise proportion estimates. Lighter models with reduced parameter counts offered faster inference, achieving real‑time throughput on an NVIDIA T4 GPU. Further analysis showed superior segmentation performance for dry and more rigid components (e.g., rice and fries), while more complex, fragmented, or viscous dishes, such as stews, showed reduced performance, specifically post‑consumption. Despite limitations such as reliance on 2D imaging, constrained food variety, and manual data collection, the proposed framework is pioneering and represents a scalable, contactless solution for continuous monitoring of food consumption. This research lays foundational groundwork for automated, real‑time waste tracking systems in large‑scale food service environments and offers actionable insights and outlines feasible future directions for dining hall management and policymakers aiming to reduce institutional food waste.

Abstract:
3D semantic segmentation provides high‑level scene understanding for applications in robotics, autonomous systems, etc. Traditional methods adapt exclusively to either task‑specific goals (open‑vocabulary segmentation) or scene content (unsupervised semantic segmentation). We propose DiSCO‑3D, the first method addressing the broader problem of 3D Open‑Vocabulary Sub‑concepts Discovery, which aims to provide a 3D semantic segmentation that adapts to both the scene and user queries. We build DiSCO‑3D on Neural Fields representations, combining unsupervised segmentation with weak open‑vocabulary guidance. Our evaluations demonstrate that DiSCO‑3D achieves effective performance in Open‑Vocabulary Sub‑concepts Discovery and exhibits state‑of‑the‑art results in the edge cases of both open‑vocabulary and unsupervised segmentation.

Abstract:
In recent years, the concept of artificial intelligence (AI) has become a prominent keyword because it is promising in solving complex tasks. The need for human expertise in specific areas may no longer be needed because machines have achieved successful results using artificial intelligence and can make the right decisions in critical situations. This process is possible with the help of deep learning (DL), one of the most popular artificial intelligence technologies. One of the areas in which the use of DL is used is in the development of self‑driving cars, which is very effective and important. In this work, we propose several efficient models to investigate scene understanding through semantic segmentation. We use the BDD100k dataset to investigate these models. Another contribution of this work is the usage of several Backbones as encoders for models. The obtained results show that choosing the appropriate backbone has a great effect on the performance of the model for semantic segmentation. Better performance in semantic segmentation allows us to understand better the scene and the environment around the agent. In the end, we analyze and evaluate the proposed models in terms of accuracy, mean IoU, and loss function, and the results show that these metrics are improved.

Abstract:
We propose a privacy‑preserving semantic‑segmentation method for applying perceptual encryption to images used for model training in addition to test images. This method also provides almost the same accuracy as models without any encryption. The above performance is achieved using a domain‑adaptation technique on the embedding structure of the Vision Transformer (ViT). The effectiveness of the proposed method was experimentally confirmed in terms of the accuracy of semantic segmentation when using a powerful semantic‑segmentation model with ViT called Segmentation Transformer.

Abstract:
Calisthenics is a fast‑growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.

Abstract:
Accurate mapping of individual trees is an important component for precision agriculture in orchards, as it allows autonomous robots to perform tasks like targeted operations or individual tree monitoring. However, creating these maps is challenging because GPS signals are often unreliable under dense tree canopies. Furthermore, standard Simultaneous Localization and Mapping (SLAM) approaches struggle in orchards because the repetitive appearance of trees can confuse the system, leading to mapping errors. To address this, we introduce Tree‑SLAM, a semantic SLAM approach tailored for creating maps of individual trees in orchards. Utilizing RGB‑D images, our method detects tree trunks with an instance segmentation model, estimates their location and re‑identifies them using a cascade‑graph‑based data association algorithm. These re‑identified trunks serve as landmarks in a factor graph framework that integrates noisy GPS signals, odometry, and trunk observations. The system produces maps of individual trees with a geo‑localization error as low as 18 cm, which is less than 20% of the planting distance. The proposed method was validated on diverse datasets from apple and pear orchards across different seasons, demonstrating high mapping accuracy and robustness in scenarios with unreliable GPS signals.

Abstract:
Public remote sensing datasets often face limitations in universality due to resolution variability and inconsistent land cover category definitions. To harness the vast pool of unlabeled remote sensing data, we propose SAMST, a semi‑supervised semantic segmentation method. SAMST leverages the strengths of the Segment Anything Model (SAM) in zero‑shot generalization and boundary detection. SAMST iteratively refines pseudo‑labels through two main components: supervised model self‑training using both labeled and pseudo‑labeled data, and a SAM‑based Pseudo‑label Refiner. The Pseudo‑label Refiner comprises three modules: a Threshold Filter Module for preprocessing, a Prompt Generation Module for extracting connected regions and generating prompts for SAM, and a Label Refinement Module for final label stitching. By integrating the generalization power of large models with the training efficiency of small models, SAMST improves pseudo‑label accuracy, thereby enhancing overall model performance. Experiments on the Potsdam dataset validate the effectiveness and feasibility of SAMST, demonstrating its potential to address the challenges posed by limited labeled data in remote sensing semantic segmentation.

Abstract:
Generalizable semantic segmentation aims to perform well on unseen target domains, a critical challenge due to real‑world applications requiring high generalizability. Class‑wise prototypes, representing class centroids, serve as domain‑invariant cues that benefit generalization due to their stability and semantic consistency. However, this approach faces three challenges. First, existing methods often adopt coarse prototypical alignment strategies, which may hinder performance. Second, naive prototypes computed by averaging source batch features are prone to overfitting and may be negatively affected by unrelated source data. Third, most methods treat all source samples equally, ignoring the fact that different features have varying adaptation difficulties. To address these limitations, we propose a novel framework for generalizable semantic segmentation: Prototypical Progressive Alignment and Reweighting (PPAR), leveraging the strong generalization ability of the CLIP model. Specifically, we define two prototypes: the Original Text Prototype (OTP) and Visual Text Prototype (VTP), generated via CLIP to serve as a solid base for alignment. We then introduce a progressive alignment strategy that aligns features in an easy‑to‑difficult manner, reducing domain gaps gradually. Furthermore, we propose a prototypical reweighting mechanism that estimates the reliability of source data and adjusts its contribution, mitigating the effect of irrelevant or harmful features (i.e., reducing negative transfer). We also provide a theoretical analysis showing the alignment between our method and domain generalization theory. Extensive experiments across multiple benchmarks demonstrate that PPAR achieves state‑of‑the‑art performance, validating its effectiveness.

Abstract:
Spatial Transcriptomics (ST) technologies provide biologists with rich insights into single‑cell biology by preserving spatial context of cells. Building foundational models for ST can significantly enhance the analysis of vast and complex data sources, unlocking new perspectives on the intricacies of biological tissues. However, modeling ST data is inherently challenging due to the need to extract multi‑scale information from tissue slices containing vast numbers of cells. This process requires integrating macro‑scale tissue morphology, micro‑scale cellular microenvironment, and gene‑scale gene expression profile. To address this challenge, we propose SToFM, a multi‑scale Spatial Transcriptomics Foundation Model. SToFM first performs multi‑scale information extraction on each ST slice, to construct a set of ST sub‑slices that aggregate macro‑, micro‑ and gene‑scale information. Then an SE(2) Transformer is used to obtain high‑quality cell representations from the sub‑slices. Additionally, we construct SToCorpus‑88M, the largest high‑resolution spatial transcriptomics corpus for pretraining. SToFM achieves outstanding performance on a variety of downstream tasks, such as tissue region semantic segmentation and cell type annotation, demonstrating its comprehensive understanding of ST data through capturing and integrating multi‑scale information.

Abstract:
Observer bias and inconsistencies in traditional plant phenotyping methods limit the accuracy and reproducibility of fine‑grained plant analysis. To overcome these challenges, we developed TomatoMAP, a comprehensive dataset for Solanum lycopersicum using an Internet of Things (IoT) based imaging system with standardized data acquisition protocols. Our dataset contains 64,464 RGB images that capture 12 different plant poses from four camera elevation angles. Each image includes manually annotated bounding boxes for seven regions of interest (ROIs), including leaves, panicle, batch of flowers, batch of fruits, axillary shoot, shoot and whole plant area, along with 50 fine‑grained growth stage classifications based on the BBCH scale. Additionally, we provide 3,616 high‑resolution image subset with pixel‑wise semantic and instance segmentation annotations for fine‑grained phenotyping. We validated our dataset using a cascading model deep learning framework combining MobileNetv3 for classification, YOLOv11 for object detection, and MaskRCNN for segmentation. Through AI vs. Human analysis involving five domain experts, we demonstrate that the models trained on our dataset achieve accuracy and speed comparable to the experts. Cohen's Kappa and inter‑rater agreement heatmap confirm the reliability of automated fine‑grained phenotyping using our approach.

Abstract:
While open‑vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., `my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing `my mug cup' among `multiple mug cups'. To overcome this challenge, we introduce a novel task termed personalized open‑vocabulary semantic segmentation and propose a text prompt tuning‑based plug‑in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs `negative mask proposal' that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS^\textper, CUB^\textper, and ADE^\textper.

Abstract:
Semantic change detection (SCD) extends the binary change detection task to provide not only the change locations but also the detailed "from‑to" categories in multi‑temporal remote sensing data. Such detailed semantic insights into changes offer considerable advantages for a wide array of applications. However, since SCD involves the simultaneous optimization of multiple tasks, the model is prone to negative transfer due to task‑specific learning difficulties and conflicting gradient flows. To address this issue, we propose Graph Aggregation Prototype Learning for Semantic Change Detection in remote sensing(GAPL‑SCD). In this framework, a multi‑task joint optimization method is designed to optimize the primary task of semantic segmentation and change detection, along with the auxiliary task of graph aggregation prototype learning. Adaptive weight allocation and gradient rotation methods are used to alleviate the conflict between training tasks and improve multi‑task learning capabilities. Specifically, the graph aggregation prototype learning module constructs an interaction graph using high‑level features. Prototypes serve as class proxies, enabling category‑level domain alignment across time points and reducing interference from irrelevant changes. Additionally, the proposed self‑query multi‑level feature interaction and bi‑temporal feature fusion modules further enhance multi‑scale feature representation, improving performance in complex scenes. Experimental results on the SECOND and Landsat‑SCD datasets demonstrate that our method achieves state‑of‑the‑art performance, with significant improvements in accuracy and robustness for SCD task.

Abstract:
Visual neuroprostheses (bionic eye) aim to restore a rudimentary form of vision by translating camera input into patterns of electrical stimulation. To improve scene understanding under extreme resolution and bandwidth constraints, prior work has explored computer vision techniques such as semantic segmentation and depth estimation. However, presenting all task‑relevant information simultaneously can overwhelm users in cluttered environments. We compare two complementary approaches to semantic preprocessing in immersive virtual reality: SemanticEdges, which highlights all relevant objects at once, and SemanticRaster, which staggers object categories over time to reduce visual clutter. Using a biologically grounded simulation of prosthetic vision, 18 sighted participants performed a wayfinding task in a dynamic urban environment across three conditions: edge‑based baseline (Control), SemanticEdges, and SemanticRaster. Both semantic strategies improved performance and user experience relative to the baseline, with each offering distinct trade‑offs: SemanticEdges increased the odds of success, while SemanticRaster boosted the likelihood of collision‑free completions. These findings underscore the value of adaptive semantic preprocessing for prosthetic vision and, more broadly, may inform the design of low‑bandwidth visual interfaces in XR that must balance information density, task relevance, and perceptual clarity.

Abstract:
Successful execution of dexterous robotic manipulation tasks in new environments, such as grasping, depends on the ability to proficiently segment unseen objects from the background and other objects. Previous works in unseen object instance segmentation (UOIS) train models on large‑scale datasets, which often leads to overfitting on static visual features. This dependency results in poor generalization performance when confronted with out‑of‑distribution scenarios. To address this limitation, we rethink the task of UOIS based on the principle that vision is inherently interactive and occurs over time. We propose a novel real‑time interactive perception framework, rt‑RISeg, that continuously segments unseen objects by robot interactions and analysis of a designed body frame‑invariant feature (BFIF). We demonstrate that the relative rotational and linear velocities of randomly sampled body frames, resulting from selected robot interactions, can be used to identify objects without any learned segmentation model. This fully self‑contained segmentation pipeline generates and updates object segmentation masks throughout each robot interaction without the need to wait for an action to finish. We showcase the effectiveness of our proposed interactive perception method by achieving an average object segmentation accuracy rate 27.5% greater than state‑of‑the‑art UOIS methods. Furthermore, although rt‑RISeg is a standalone framework, we show that the autonomously generated segmentation masks can be used as prompts to vision foundation models for significantly improved performance.

Abstract:
We introduce FGSSNet, a novel multi‑headed feature‑guided semantic segmentation (FGSS) architecture designed to improve the generalization ability of wall segmentation on floorplans. FGSSNet features a U‑Net segmentation backbone with a multi‑headed dedicated feature extractor used to extract domain‑specific feature maps which are injected into the latent space of U‑Net to guide the segmentation process. This dedicated feature extractor is trained as an encoder‑decoder with selected wall patches, representative of the walls present in the input floorplan, to produce a compressed latent representation of wall patches while jointly trained to predict the wall width. In doing so, we expect that the feature extractor encodes texture and width features of wall patches that are useful to guide the wall segmentation process. Our experiments show increased performance by the use of such injected features in comparison to the vanilla U‑Net, highlighting the validity of the proposed approach.

Abstract:
Recent research has investigated the shape and texture biases of deep neural networks (DNNs) in image classification which influence their generalization capabilities and robustness. It has been shown that, in comparison to regular DNN training, training with stylized images reduces texture biases in image classification and improves robustness with respect to image corruptions. In an effort to advance this line of research, we examine whether style transfer can likewise deliver these two effects in semantic segmentation. To this end, we perform style transfer with style varying across artificial image areas. Those random areas are formed by a chosen number of Voronoi cells. The resulting style‑transferred data is then used to train semantic segmentation DNNs with the objective of reducing their dependence on texture cues while enhancing their reliance on shape‑based features. In our experiments, it turns out that in semantic segmentation, style transfer augmentation reduces texture bias and strongly increases robustness with respect to common image corruptions as well as adversarial attacks. These observations hold for convolutional neural networks and transformer architectures on the Cityscapes dataset as well as on PASCAL Context, showing the generality of the proposed method.

Abstract:
We present Spatial Lifting (SL), a novel methodology for dense prediction tasks. SL operates by lifting standard inputs, such as 2D images, into a higher‑dimensional space and subsequently processing them using networks designed for that higher dimension, such as a 3D U‑Net. Counterintuitively, this dimensionality lifting allows us to achieve good performance on benchmark tasks compared to conventional approaches, while reducing inference costs and significantly lowering the number of model parameters. The SL framework produces intrinsically structured outputs along the lifted dimension. This emergent structure facilitates dense supervision during training and enables robust, near‑zero‑additional‑cost prediction quality assessment at test time. We validate our approach across 19 benchmark datasets (13 for semantic segmentation and 6 for depth estimation), demonstrating competitive dense prediction performance while reducing the model parameter count by over 98% (in the U‑Net case) and lowering inference costs. Spatial Lifting introduces a new vision modeling paradigm that offers a promising path toward more efficient, accurate, and reliable deep networks for dense prediction tasks in vision.

Abstract:
Deploying transformer‑based neural networks on resource‑constrained edge devices presents a significant challenge. This challenge is often addressed through various techniques, such as low‑rank approximation and mixed‑precision quantization. In this work, we introduce Mixed Low‑Rank and Quantization (MLoRQ), a novel method that integrates both techniques. MLoRQ employs a two‑stage optimization process to determine optimal bit‑width and rank assignments for each layer, adhering to predefined memory constraints. This process includes: (i) an intra‑layer optimization that identifies potentially optimal compression solutions out of all low‑rank and quantization combinations; (ii) an inter‑layer optimization that assigns bit‑width precision and rank to each layer while ensuring the memory constraint is met. An optional final step applies a sequential optimization process using a modified adaptive rounding technique to mitigate compression‑induced errors in joint low‑rank approximation and quantization. The method is compatible and can be seamlessly integrated with most existing quantization algorithms. MLoRQ shows state‑of‑the‑art results with up to 15% performance improvement, evaluated on Vision Transformers for image classification, object detection, and instance segmentation tasks.

Abstract:
Surgical video segmentation is a critical task in computer‑assisted surgery, essential for enhancing surgical quality and patient outcomes. Recently, the Segment Anything Model 2 (SAM2) framework has demonstrated remarkable advancements in both image and video segmentation. However, the inherent limitations of SAM2's greedy selection memory design are amplified by the unique properties of surgical videos‑rapid instrument movement, frequent occlusion, and complex instrument‑tissue interaction‑resulting in diminished performance in the segmentation of complex, long videos. To address these challenges, we introduce Memory Augmented (MA)‑SAM2, a training‑free video object segmentation strategy, featuring novel context‑aware and occlusion‑resilient memory models. MA‑SAM2 exhibits strong robustness against occlusions and interactions arising from complex instrument movements while maintaining accuracy in segmenting objects throughout videos. Employing a multi‑target, single‑loop, one‑prompt inference further enhances the efficiency of the tracking process in multi‑instrument videos. Without introducing any additional parameters or requiring further training, MA‑SAM2 achieved performance improvements of 4.36% and 6.1% over SAM2 on the EndoVis2017 and EndoVis2018 datasets, respectively, demonstrating its potential for practical surgical applications.

Abstract:
We propose SegVec3D, a novel framework for 3D point cloud instance segmentation that integrates attention mechanisms, embedding learning, and cross‑modal alignment. The approach builds a hierarchical feature extractor to enhance geometric structure modeling and enables unsupervised instance segmentation via contrastive clustering. It further aligns 3D data with natural language queries in a shared semantic space, supporting zero‑shot retrieval. Compared to recent methods like Mask3D and ULIP, our method uniquely unifies instance segmentation and multimodal understanding with minimal supervision and practical deployability.

Abstract:
High‑definition (HD) semantic mapping of complex intersections poses significant challenges for traditional vehicle‑based approaches due to occlusions and limited perspectives. This paper introduces a novel camera‑LiDAR fusion framework that leverages elevated intelligent roadside units (IRUs). Additionally, we present RS‑seq, a comprehensive dataset developed through the systematic enhancement and annotation of the V2X‑Seq dataset. RS‑seq includes precisely labelled camera imagery and LiDAR point clouds collected from roadside installations, along with vectorized maps for seven intersections annotated with detailed features such as lane dividers, pedestrian crossings, and stop lines. This dataset facilitates the systematic investigation of cross‑modal complementarity for HD map generation using IRU data. The proposed fusion framework employs a two‑stage process that integrates modality‑specific feature extraction and cross‑modal semantic integration, capitalizing on camera high‑resolution texture and precise geometric data from LiDAR. Quantitative evaluations using the RS‑seq dataset demonstrate that our multimodal approach consistently surpasses unimodal methods. Specifically, compared to unimodal baselines evaluated on the RS‑seq dataset, the multimodal approach improves the mean Intersection‑over‑Union (mIoU) for semantic segmentation by 4% over the image‑only results and 18% over the point cloud‑only results. This study establishes a baseline methodology for IRU‑based HD semantic mapping and provides a valuable dataset for future research in infrastructure‑assisted autonomous driving systems.

Abstract:
Semantic segmentation relies on many dense pixel‑wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large‑scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low‑data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel‑level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA‑KPN), that guarantees semantic matching between the synthetic label and translation. DA‑KPN estimates pixel‑wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel‑wise transformation is realistic, DA‑KPN uses multi‑scale discriminators to distinguish between translated and target samples. We show DA‑KPN outperforms previous GAN‑based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.

Abstract:
Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks and has become the state‑of‑the‑art for visual object tracking. The model stores information from previous frames in a memory bank, enabling temporal consistency across video sequences. Recent methods augment SAM 2 with hand‑crafted update rules to better handle distractors, occlusions, and object motion. We propose a fundamentally different approach using reinforcement learning for optimizing memory updates in SAM 2 by framing memory control as a sequential decision‑making problem. In an overfitting setup with a separate agent per video, our method achieves a relative improvement over SAM 2 that exceeds by more than three times the gains of existing heuristics. These results reveal the untapped potential of the memory bank and highlight reinforcement learning as a powerful alternative to hand‑crafted update rules for memory control in visual object tracking.

Abstract:
Low‑level enhancement and high‑level visual understanding in low‑light vision have traditionally been treated separately. Low‑light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low‑light visual understanding, constrained by scarce labeled data, primarily uses task‑specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low‑light enhancement and low‑light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low‑light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero‑shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine‑tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination‑aware image prompt to explicitly guide image generation and propose a cycle‑attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high‑level semantics and image‑level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state‑of‑the‑art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.

Abstract:
We present SurfDist, a convolutional neural network architecture for three‑dimensional volumetric instance segmentation. SurfDist enables prediction of instances represented as closed surfaces composed of smooth parametric surface patches, specifically bicubic Bézier triangles. SurfDist is a modification of the popular model architecture StarDist‑3D which breaks StarDist‑3D's coupling of instance parameterization dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob‑shaped instances, common in biomedical imaging, SurfDist can outperform StarDist‑3D with more compact instance parameterizations. We detail SurfDist's technical implementation and show one synthetic and one real‑world dataset for which it outperforms StarDist‑3D. These results demonstrate that interpretable instance surface models can be learned effectively alongside instance membership.

Abstract:
Conservation and decision‑making regarding forest resources necessitate regular forest inventory. Light detection and ranging (LiDAR) in laser scanning systems has gained significant attention over the past two decades as a remote and non‑destructive solution to streamline the labor‑intensive and time‑consuming procedure of forest inventory. Advanced multispectral (MS) LiDAR systems simultaneously acquire three‑dimensional (3D) spatial and spectral information across multiple wavelengths of the electromagnetic spectrum. Consequently, MS‑LiDAR technology enables the estimation of both the biochemical and biophysical characteristics of forests. Forest component segmentation is crucial for forest inventory. The synergistic use of spatial and spectral laser information has proven to be beneficial for achieving precise forest semantic segmentation. Thus, this study aims to investigate the potential of MS‑LiDAR data, captured by the HeliALS system, providing high‑density multispectral point clouds to segment forests into six components: ground, low vegetation, trunks, branches, foliage, and woody debris. Three point‑wise 3D deep learning models and one machine learning model, including kernel point convolution, superpoint transformer, point transformer V3, and random forest, are implemented. Our experiments confirm the superior accuracy of the KPConv model. Additionally, various geometric and spectral feature vector scenarios are examined. The highest accuracy is achieved by feeding all three wavelengths (1550 nm, 905 nm, and 532 nm) as the initial features into the deep learning model, resulting in improvements of 33.73% and 32.35% in mean intersection over union (mIoU) and in mean accuracy (mAcc), respectively. This study highlights the excellent potential of multispectral LiDAR for improving the accuracy in fully automated forest component segmentation.

Abstract:
The application of methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3D GS) have steadily gained popularity in the field of 3D object segmentation in static scenes. These approaches demonstrate efficacy in a range of 3D scene understanding and editing tasks. Nevertheless, the 4D object segmentation of dynamic scenes remains an underexplored field due to the absence of a sufficiently extensive and accurately labelled multi‑view video dataset. In this paper, we present MUVOD, a new multi‑view video dataset for training and evaluating object segmentation in reconstructed real‑world scenarios. The 17 selected scenes, describing various indoor or outdoor activities, are collected from different sources of datasets originating from various types of camera rigs. Each scene contains a minimum of 9 views and a maximum of 46 views. We provide 7830 RGB images (30 frames per video) with their corresponding segmentation mask in 4D motion, meaning that any object of interest in the scene could be tracked across temporal frames of a given view or across different views belonging to the same camera rig. This dataset, which contains 459 instances of 73 categories, is intended as a basic benchmark for the evaluation of multi‑view video segmentation methods. We also present an evaluation metric and a baseline segmentation approach to encourage and evaluate progress in this evolving field. Additionally, we propose a new benchmark for 3D object segmentation task with a subset of annotated multi‑view images selected from our MUVOD dataset. This subset contains 50 objects of different conditions in different scenarios, providing a more comprehensive analysis of state‑of‑the‑art 3D object segmentation methods. Our proposed MUVOD dataset is available at https://volumetric‑repository.labs.b‑com.com/#/muvod.

Abstract:
Visual effects (VFX) production often struggles with slow, resource‑intensive mask generation. This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks. It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per‑frame image segmentation and (3) robust video tracking to ensure temporal stability. Deployed using containerization and leveraging a structured output format, the pipeline was quickly adopted by our artists. It significantly reduces manual effort, speeds up the creation of preliminary composites, and provides comprehensive segmentation data, thereby enhancing overall VFX production efficiency.

Authors: Johanna Orsholm, John Quinto, Hannu Autto, Gaia Banelyte, Nicolas Chazot, Jeremy deWaard, Stephanie deWaard, Arielle Farrell, Brendan Furneaux, Bess Hardwick, Nao Ito, Amlan Kar, Oula Kalttopää, Deirdre Kerdraon, Erik Kristensen, Jaclyn McKeown, Tommi Mononen, Ellen Nein, Hanna Rogers, Tomas Roslin, Paula Schmitz, Jayme Sones, Maija Sujala, Amy Thompson, Evgeny V. Zakharov, Iuliia Zarubiieva, Akshita Gupta, Scott C. Lowe, Graham W. Taylor

Abstract:
Insects comprise millions of species, many experiencing severe population declines under environmental and habitat changes. High‑throughput approaches are crucial for accelerating our understanding of insect diversity, with DNA barcoding and high‑resolution imaging showing strong potential for automatic taxonomic classification. However, most image‑based approaches rely on individual specimen data, unlike the unsorted bulk samples collected in large‑scale ecological surveys. We present the Mixed Arthropod Sample Segmentation and Identification (MassID45) dataset for training automatic classifiers of bulk insect samples. It uniquely combines molecular and imaging data at both the unsorted sample level and the full set of individual specimens. Human annotators, supported by an AI‑assisted tool, performed two tasks on bulk images: creating segmentation masks around each individual arthropod and assigning taxonomic labels to over 17 000 specimens. Combining the taxonomic resolution of DNA barcodes with precise abundance estimates of bulk images holds great potential for rapid, large‑scale characterization of insect communities. This dataset pushes the boundaries of tiny object detection and instance segmentation, fostering innovation in both ecological and machine learning research.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end‑to‑end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token ‑ class assignment. At inference time, we aggregate the different self‑attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self‑attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo‑masks, outperforming related works. Those pseudo‑masks can be used to train a segmentation model which achieves results comparable to fully‑supervised models, significantly reducing the need for fine‑grained labeled data.

Abstract:
This paper presents StixelNExT++, a novel approach to scene representation for monocular perception systems. Building on the established Stixel representation, our method infers 3D Stixels and enhances object segmentation by clustering smaller 3D Stixel units. The approach achieves high compression of scene information while remaining adaptable to point cloud and bird's‑eye‑view representations. Our lightweight neural network, trained on automatically generated LiDAR‑based ground truth, achieves real‑time performance with computation times as low as 10 ms per frame. Experimental results on the Waymo dataset demonstrate competitive performance within a 30‑meter range, highlighting the potential of StixelNExT++ for collective perception in autonomous systems.

Abstract:
Collecting and annotating images for the purpose of training segmentation models is often cost prohibitive. In the domain of wildland fire science, this challenge is further compounded by the scarcity of reliable public datasets with labeled ground truth. This paper presents the Centralized Copy‑Paste Data Augmentation (CCPDA) method, for the purpose of assisting with the training of deep‑learning multiclass segmentation models, with special focus on improving segmentation outcomes for the fire‑class. CCPDA has three main steps: (i) identify fire clusters in the source image, (ii) apply a centralization technique to focus on the core of the fire area, and (iii) paste the refined fire clusters onto a target image. This method increases dataset diversity while preserving the essential characteristics of the fire class. The effectiveness of this augmentation technique is demonstrated via numerical analysis and comparison against various other augmentation methods using a weighted sum‑based multi‑objective optimization approach. This approach helps elevate segmentation performance metrics specific to the fire class, which carries significantly more operational significance than other classes (fuel, ash, or background). Numerical performance assessment validates the efficacy of the presented CCPDA method in alleviating the difficulties associated with small, manually labeled training datasets. It also illustrates that CCPDA outperforms other augmentation strategies in the application scenario considered, particularly in improving fire‑class segmentation performance.

Abstract:
Recent advancements in robotic grasping have led to its integration as a core module in many manipulation systems. For instance, language‑driven semantic segmentation enables the grasping of any designated object or object part. However, existing methods often struggle to generate feasible grasp poses for small objects or delicate components, potentially causing the entire pipeline to fail. To address this issue, we propose a novel grasping method, FineGrasp, which introduces improvements in three key aspects. First, we introduce multiple network modifications to enhance the ability of to handle delicate regions. Second, we address the issue of label imbalance and propose a refined graspness label normalization strategy. Third, we introduce a new simulated grasp dataset and show that mixed sim‑to‑real training further improves grasp performance. Experimental results show significant improvements, especially in grasping small objects, and confirm the effectiveness of our system in semantic grasping.

Abstract:
Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin‑L backbone, our EDC method gets 56.2 AP, which sets a new state‑of‑the‑art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.

Abstract:
As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data‑driven paradigm, its application potential is often constrained by the incompleteness of real‑world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component‑level instance annotations, high‑fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real‑world bridge semantic segmentation. Concurrently, a fine‑tuned KT‑Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure.

Abstract:
Panoptic Scene Graph Generation (PSG) integrates instance segmentation with relation understanding to capture pixel‑level structural relationships in complex scenes. Although recent approaches leveraging pre‑trained vision‑language models (VLMs) have significantly improved performance in the open‑vocabulary setting, they commonly ignore the inherent limitations of VLMs in spatial relation reasoning, such as difficulty in distinguishing object relative positions, which results in suboptimal relation prediction. Motivated by the denoising diffusion model's inversion process in preserving the spatial structure of input images, we propose SPADE (SPatial‑Aware Denoising‑nEtwork) framework ‑‑ a novel approach for open‑vocabulary PSG. SPADE consists of two key steps: (1) inversion‑guided calibration for the UNet adaptation, and (2) spatial‑aware context reasoning. In the first step, we calibrate a general pre‑trained teacher diffusion model into a PSG‑specific denoising network with cross‑attention maps derived during inversion through a lightweight LoRA‑based fine‑tuning strategy. In the second step, we develop a spatial‑aware relation graph transformer that captures both local and long‑range contextual information, facilitating the generation of high‑quality relation queries. Extensive experiments on benchmark PSG and Visual Genome datasets demonstrate that SPADE outperforms state‑of‑the‑art methods in both closed‑ and open‑set scenarios, particularly for spatial relationship prediction.

Abstract:
Partial‑view 3D recognition ‑‑ reconstructing 3D geometry and identifying object instances from a few sparse RGB images ‑‑ is an exceptionally challenging yet practically essential task, particularly in cluttered, occluded real‑world settings where full‑view or reliable depth data are often unavailable. Existing methods, whether based on strong symmetry priors or supervised learning on curated datasets, fail to generalize to such scenarios. In this work, we introduce DreamGrasp, a framework that leverages the imagination capability of large‑scale pre‑trained image generative models to infer the unobserved parts of a scene. By combining coarse 3D reconstruction, instance segmentation via contrastive learning, and text‑guided instance‑wise refinement, DreamGrasp circumvents limitations of prior methods and enables robust 3D reconstruction in complex, multi‑object environments. Our experiments show that DreamGrasp not only recovers accurate object geometry but also supports downstream tasks like sequential decluttering and target retrieval with high success rates.

Abstract:
Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer‑aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high‑quality instance‑level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology. This work presents a multi‑task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 1048 wrist conventional radiographs of 388 patients from six medical centers, with pixel‑level instance segmentation annotations for 618 images and SvdH BE scores for 800 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist‑related tasks, such as carpal bone fracture localization. We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA‑related domain.

Abstract:
We present MOSU, a novel autonomous long‑range navigation system that enhances global navigation for mobile robots through multimodal perception and on‑road scene understanding. MOSU addresses the outdoor robot navigation challenge by integrating geometric, semantic, and contextual information to ensure comprehensive scene understanding. The system combines GPS and QGIS map‑based routing for high‑level global path planning and multi‑modal trajectory generation for local navigation refinement. For trajectory generation, MOSU leverages multi‑modalities: LiDAR‑based geometric data for precise obstacle avoidance, image‑based semantic segmentation for traversability assessment, and Vision‑Language Models (VLMs) to capture social context and enable the robot to adhere to social norms in complex environments. This multi‑modal integration improves scene understanding and enhances traversability, allowing the robot to adapt to diverse outdoor conditions. We evaluate our system in real‑world on‑road environments and benchmark it on the GND dataset, achieving a 10% improvement in traversability on navigable terrains while maintaining a comparable navigation distance to existing global navigation methods.

Abstract:
In recent years, cities have increasingly reduced speed limits from 50 km/h to 30 km/h to enhance road safety, reduce noise pollution, and promote sustainable modes of transportation. However, achieving compliance with these new limits remains a key challenge for urban planners. This study investigates drivers' compliance with the 30 km/h speed limit in Milan and examines how street characteristics influence driving behavior. Our findings suggest that the mere introduction of lower speed limits is not sufficient to reduce driving speeds effectively, highlighting the need to understand how street design can improve speed limit adherence. To comprehend this relationship, we apply computer vision‑based semantic segmentation models to Google Street View images. A large‑scale analysis reveals that narrower streets and densely built environments are associated with lower speeds, whereas roads with greater visibility and larger sky views encourage faster driving. To evaluate the influence of the local context on speeding behaviour, we apply the developed methodological framework to two additional cities: Amsterdam, which, similar to Milan, is a historic European city not originally developed for cars, and Dubai, which instead has developed in recent decades with a more car‑centric design. The results of the analyses largely confirm the findings obtained in Milan, which demonstrates the broad applicability of the road design guidelines for driver speed compliance identified in this paper. Finally, we develop a machine learning model to predict driving speeds based on street characteristics. We showcase the model's predictive power by estimating the compliance with speed limits in Milan if the city were to adopt a 30 km/h speed limit city‑wide. The tool provides actionable insights for urban planners, supporting the design of interventions to improve speed limit compliance.

Abstract:
Understanding surgical scenes can provide better healthcare quality for patients, especially with the vast amount of video data that is generated during MIS. Processing these videos generates valuable assets for training sophisticated models. In this paper, we introduce CLIP‑RL, a novel contrastive language‑image pre‑training model tailored for semantic segmentation for surgical scenes. CLIP‑RL presents a new segmentation approach which involves reinforcement learning and curriculum learning, enabling continuous refinement of the segmentation masks during the full training pipeline. Our model has shown robust performance in different optical settings, such as occlusions, texture variations, and dynamic lighting, presenting significant challenges. CLIP model serves as a powerful feature extractor, capturing rich semantic context that enhances the distinction between instruments and tissues. The RL module plays a pivotal role in dynamically refining predictions through iterative action‑space adjustments. We evaluated CLIP‑RL on the EndoVis 2018 and EndoVis 2017 datasets. CLIP‑RL achieved a mean IoU of 81%, outperforming state‑of‑the‑art models, and a mean IoU of 74.12% on EndoVis 2017. This superior performance was achieved due to the combination of contrastive learning with reinforcement learning and curriculum learning.

Abstract:
Holistic surgical scene segmentation in robot‑assisted surgery (RAS) enables surgical residents to identify various anatomical tissues, articulated tools, and critical structures, such as veins and vessels. Given the firm intraoperative time constraints, it is challenging for surgeons to provide detailed real‑time explanations of the operative field for trainees. This challenge is compounded by the scarcity of expert surgeons relative to trainees, making the unambiguous delineation of go‑ and no‑go zones inconvenient. Therefore, high‑performance semantic segmentation models offer a solution by providing clear postoperative analyses of surgical procedures. However, recent advanced segmentation models rely on user‑generated prompts, rendering them impractical for lengthy surgical videos that commonly exceed an hour. To address this challenge, we introduce Surg‑SegFormer, a novel prompt‑free model that outperforms current state‑of‑the‑art techniques. Surg‑SegFormer attained a mean Intersection over Union (mIoU) of 0.80 on the EndoVis2018 dataset and 0.54 on the EndoVis2017 dataset. By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons, empowering residents to independently and effectively understand complex surgical environments.

Abstract:
Ray tracing is a widely used deterministic method for radio propagation simulations, capable of producing physically accurate multipath components. The accuracy depends on the quality of the environment model and its electromagnetic properties. Recent advances in computer vision and machine learning have made it possible to reconstruct detailed environment models augmented with semantic segmentation labels. In this letter, we propose a differentiable ray tracing‑based radio propagation simulator that operates directly on point clouds. We showcase the efficiency of our method by simulating multi‑bounce propagation paths with up to five interactions with specular reflections and diffuse scattering in two indoor scenarios, each completing in less than 90 ms. Lastly, we demonstrate how the differentiability of electromagnetic computations can be combined with segmentation labels to learn the electromagnetic properties of the environment.

Abstract:
Effective Out‑of‑Distribution (OOD) detection is criti‑cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT‑4, which significantly enhanced multimodal reasoning through Chain‑of‑Thought (CoT) prompting, the application of CoT‑based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state‑of‑the‑art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground‑dominant objects. To address the presented challenges, we propose a novel CoT‑based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT‑4, to enhance OOD detection through improved image understanding and prompt‑based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state‑of‑the‑art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.

Abstract:
Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event‑based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event‑Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state‑of‑the‑art accuracy across the DDD17‑Seg, DSEC‑Semantic, and M3ED‑Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC‑Semantic dataset.

Abstract:
The awareness about moving objects in the surroundings of a self‑driving vehicle is essential for safe and reliable autonomous navigation. The interpretation of LiDAR and camera data achieves exceptional results but typically requires to accumulate and process temporal sequences of data in order to extract motion information. In contrast, radar sensors, which are already installed in most recent vehicles, can overcome this limitation as they directly provide the Doppler velocity of the detections and, hence incorporate instantaneous motion information within a single measurement. % In this paper, we tackle the problem of moving object segmentation in noisy radar point clouds. We also consider differentiating parked from moving cars, to enhance scene understanding. Instead of exploiting temporal dependencies to identify moving objects, we develop a novel transformer‑based approach to perform single‑scan moving object segmentation in sparse radar scans accurately. The key to our Radar Velocity Transformer is to incorporate the valuable velocity information throughout each module of the network, thereby enabling the precise segmentation of moving and non‑moving objects. Additionally, we propose a transformer‑based upsampling, which enhances the performance by adaptively combining information and overcoming the limitation of interpolation of sparse point clouds. Finally, we create a new radar moving object segmentation benchmark based on the RadarScenes dataset and compare our approach to other state‑of‑the‑art methods. Our network runs faster than the frame rate of the sensor and shows superior segmentation results using only single‑scan radar data.

Abstract:
Event cameras offer significant potential for Low‑light Image Enhancement (LLIE), yet existing fusion approaches are constrained by a fundamental dilemma: early fusion struggles with modality heterogeneity, while late fusion severs crucial feature correlations. To address these limitations, we propose EvRWKV, a novel framework that enables continuous cross‑modal interaction through dual‑domain processing, which mainly includes a Cross‑RWKV Module to capture fine‑grained temporal and cross‑modal dependencies, and an Event Image Spectral Fusion Enhancer (EISFE) module to perform joint adaptive frequency‑domain denoising and spatial‑domain alignment. This continuous interaction maintains feature consistency from low‑level textures to high‑level semantics. Extensive experiments on the real‑world SDE and SDSD datasets demonstrate that EvRWKV significantly outperforms only image‑based methods by 1.79 dB and 1.85 dB in PSNR, respectively. To further validate the practical utility of our method for downstream applications, we evaluated its impact on semantic segmentation. Experiments demonstrate that images enhanced by EvRWKV lead to a significant 35.44% improvement in mIoU.

Abstract:
The performance of image segmentation models has historically been constrained by the high cost of collecting large‑scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics‑agnostic, segmentation paradigm and yet still requires manual visual‑prompts or complex domain‑dependent prompt‑generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance‑level segmentation masks for downstream tasks and instantiate our ideas via a multi‑stage, training‑free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic‑aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state‑of‑the‑art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few‑Shot (71.2% nAP50) and outperforming existing training‑free approaches on the Cross‑Domain FSOD benchmark (22.4% nAP).

Abstract:
In the aftermath of earthquakes, social media images have become a crucial resource for disaster reconnaissance, providing immediate insights into the extent of damage. Traditional approaches to damage severity assessment in post‑earthquake social media images often rely on classification methods, which are inherently subjective and incapable of accounting for the varying extents of damage within an image. Addressing these limitations, this study proposes a novel approach by framing damage severity assessment as a semantic segmentation problem, aiming for a more objective analysis of damage in earthquake‑affected areas. The methodology involves the construction of a segmented damage severity dataset, categorizing damage into three degrees: undamaged structures, damaged structures, and debris. Utilizing this dataset, the study fine‑tunes a SegFormer model to generate damage severity segmentations for post‑earthquake social media images. Furthermore, a new damage severity scoring system is introduced, quantifying damage by considering the varying degrees of damage across different areas within images, adjusted for depth estimation. The application of this approach allows for the quantification of damage severity in social media images in a more objective and comprehensive manner. By providing a nuanced understanding of damage, this study enhances the ability to offer precise guidance to disaster reconnaissance teams, facilitating more effective and targeted response efforts in the aftermath of earthquakes.

Abstract:
Multiple instance learning (MIL) significantly reduced annotation costs via bag‑level weak labels for large‑scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., 16 × 16) using pre‑trained models. This approach seems infeasible for MIL localization due to enormous amounts (～ 10^5) of large patches (e.g., 256 × 256) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework for both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes‑based Pseudo‑Labeling (BPPL) for reliable instance pseudo‑labeling, and (3) Orthogonal Weighted Low‑Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets demonstrate superior performance of CoMEL, outperforming the prior arts by up to 11.00% in bag‑level accuracy and up to 23.4% in localization accuracy under the continual MIL setup.

Abstract:
Recent advances in brain‑vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two‑level strategy, i.e., pixel‑level and semantic‑level. However, these methods rely heavily on low‑level pixel alignment yet lack sufficient and fine‑grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain's visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi‑scale image features via cross‑attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi‑object semantic cues and coarse spatial localization information‑elements that current models have yet to fully exploit or integrate.

Abstract:
Multimodal foundation models (MFMs), such as GPT‑4o, have recently made remarkable progress. However, their detailed visual understanding beyond question answering remains unclear. In this paper, we benchmark popular MFMs (GPT‑4o, o4‑mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2‑VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet, etc). The main challenges in performing this analysis are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these by translating vision tasks into text‑promptable, API‑compatible formats via prompt chaining, creating a standardized benchmarking framework. We observe that: 1) The MFMs are not close to the state‑of‑the‑art specialist models at any task. 2) They are respectable generalists; this is remarkable, as they are presumably trained on image‑text‑based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) GPT‑4o performs the best among non‑reasoning models, securing the top position in 4 out of 6 tasks. 5) Reasoning models, e.g., o3, show improvements in geometric tasks. 6) While prompt chaining techniques affect performance, better models are less sensitive to prompt variations. 7) An analysis of models with native image generation, such as the latest GPT‑4o, shows they exhibit failure modes, such as hallucinated objects or misalignment between input and output.

Abstract:
In orchard automation, dense foliage during the canopy season severely occludes tree structures, minimizing visibility to various canopy parts such as trunks and branches, which limits the ability of a machine vision system. However, canopy structure is more open and visible during the dormant season when trees are defoliated. In this work, we present an information fusion framework that integrates multi‑seasonal structural data to support robotic and automated crop load management during the entire growing season. The framework combines high‑resolution RGB‑D imagery from both dormant and canopy periods using YOLOv9‑Seg for instance segmentation, Kinect Fusion for 3D reconstruction, and Fast Generalized Iterative Closest Point (Fast GICP) for model alignment. Segmentation outputs from YOLOv9‑Seg were used to extract depth‑informed masks, which enabled accurate 3D point cloud reconstruction via Kinect Fusion; these reconstructed models from each season were subsequently aligned using Fast GICP to achieve spatially coherent multi‑season fusion. The YOLOv9‑Seg model, trained on manually annotated images, achieved a mean squared error (MSE) of 0.0047 and segmentation mAP@50 scores up to 0.78 for trunks in dormant season dataset. Kinect Fusion enabled accurate reconstruction of tree geometry, validated with field measurements resulting in root mean square errors (RMSE) of 5.23 mm for trunk diameter, 4.50 mm for branch diameter, and 13.72 mm for branch spacing. Fast GICP achieved precise cross‑seasonal registration with a minimum fitness score of 0.00197, allowing integrated, comprehensive tree structure modeling despite heavy occlusions during the growing season. This fused structural representation enables robotic systems to access otherwise obscured architectural information, improving the precision of pruning, thinning, and other automated orchard operations.

Abstract:
Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large‑scale patches (low‑frequency information) but also the precise localization of boundaries between patches (high‑frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low‑frequency features, while overlooking its inherent limitations in learning high‑frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high‑frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high‑frequency features; however, we also observe that these models exhibit insufficient semantic inference for low‑frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion‑based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi‑class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures.

Abstract:
Instance segmentation of novel objects instances in RGB images, given some example images for each object, is a well known problem in computer vision. Designing a model general enough to be employed for all kinds of novel objects without (re‑) training has proven to be a difficult task. To handle this, we present a new training‑free framework, called: Novel Object Cyclic Threshold based Instance Segmentation (NOCTIS). NOCTIS integrates two pre‑trained models: Grounded‑SAM 2 for object proposals with precise bounding boxes and corresponding segmentation masks; and DINOv2 for robust class and patch embeddings, due to its zero‑shot capabilities. Internally, the proposal‑object matching is realized by determining an object matching score based on the similarity of the class embeddings and the average maximum similarity of the patch embeddings with a new cyclic thresholding (CT) mechanism that mitigates unstable matches caused by repetitive textures or visually similar patterns. Beyond CT, NOCTIS introduces: (i) an appearance score that is unaffected by object selection bias; (ii) the usage of the average confidence of the proposals' bounding box and mask as a scoring component; and (iii) an RGB‑only pipeline that performs even better than RGB‑D ones. We empirically show that NOCTIS, without further training/fine tuning, outperforms the best RGB and RGB‑D methods regarding the mean AP score on the seven core datasets of the BOP 2023 challenge for the "Model‑based 2D segmentation of unseen objects" task.

Abstract:
Synthesizing realistic microstructure images conditioned on processing parameters is crucial for understanding process‑structure relationships in materials design. However, this task remains challenging due to limited training micrographs and the continuous nature of processing variables. To overcome these challenges, we present a novel process‑aware generative modeling approach based on Stable Diffusion 3.5 Large (SD3.5‑Large), a state‑of‑the‑art text‑to‑image diffusion model adapted for microstructure generation. Our method introduces numeric‑aware embeddings that encode continuous variables (annealing temperature, time, and magnification) directly into the model's conditioning, enabling controlled image generation under specified process conditions and capturing process‑driven microstructural variations. To address data scarcity and computational constraints, we fine‑tune only a small fraction of the model's weights via DreamBooth and Low‑Rank Adaptation (LoRA), efficiently transferring the pre‑trained model to the materials domain. We validate realism using a semantic segmentation model based on a fine‑tuned U‑Net with a VGG16 encoder on 24 labeled micrographs. It achieves 97.1% accuracy and 85.7% mean IoU, outperforming previous methods. Quantitative analyses using physical descriptors and spatial statistics show strong agreement between synthetic and real microstructures. Specifically, two‑point correlation and lineal‑path errors remain below 2.1% and 0.6%, respectively. Our method represents the first adaptation of SD3.5‑Large for process‑aware microstructure generation, offering a scalable approach for data‑driven materials design.

Abstract:
Organ segmentation of plant point clouds is a prerequisite for the high‑resolution and accurate extraction of organ‑level phenotypic traits. Although the fast development of deep learning has boosted much research on segmentation of plant point clouds, the existing techniques for organ segmentation still face limitations in resolution, segmentation accuracy, and generalizability across various plant species. In this study, we proposed a novel approach called plant segmentation neural radiance fields (PlantSegNeRF), aiming to directly generate high‑precision instance point clouds from multi‑view RGB image sequences for a wide range of plant species. PlantSegNeRF performed 2D instance segmentation on the multi‑view images to generate instance masks for each organ with a corresponding ID. The multi‑view instance IDs corresponding to the same plant organ were then matched and refined using a specially designed instance matching module. The instance NeRF was developed to render an implicit scene, containing color, density, semantic and instance information. The implicit scene was ultimately converted into high‑precision plant instance point clouds based on the volume density. The results proved that in semantic segmentation of point clouds, PlantSegNeRF outperformed the commonly used methods, demonstrating an average improvement of 16.1%, 18.3%, 17.8%, and 24.2% in precision, recall, F1‑score, and IoU compared to the second‑best results on structurally complex species. More importantly, PlantSegNeRF exhibited significant advantages in plant point cloud instance segmentation tasks. Across all plant species, it achieved average improvements of 11.7%, 38.2%, 32.2% and 25.3% in mPrec, mRec, mCov, mWCov, respectively. This study extends the organ‑level plant phenotyping and provides a high‑throughput way to supply high‑quality 3D data for the development of large‑scale models in plant science.

Abstract:
High‑resolution imaging is crucial for enhancing visual clarity and enabling precise computer‑assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic‑assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging and video dataset captured at a native 4K resolution, representing realistic conditions of robotic‑assisted procedures. SurgiSR4K comprises diverse visual scenarios including specular reflections, tool occlusions, bleeding, and soft tissue deformations, meticulously designed to reflect common challenges faced during laparoscopic and robotic surgeries. This dataset opens up possibilities for a broad range of computer vision tasks that might benefit from high resolution data, such as super resolution (SR), smoke removal, surgical instrument detection, 3D tissue reconstruction, monocular depth estimation, instance segmentation, novel view synthesis, and vision‑language model (VLM) development. SurgiSR4K provides a robust foundation for advancing research in high‑resolution surgical imaging and fosters the development of intelligent imaging technologies aimed at enhancing performance, safety, and usability in image‑guided robotic surgeries.

Abstract:
The performance of leaning‑based perception algorithms suffer when deployed in out‑of‑distribution and underrepresented environments. Outdoor robots are particularly susceptible to rapid changes in visual scene appearance due to dynamic lighting, seasonality and weather effects that lead to scenes underrepresented in the training data of the learning‑based perception system. In this conceptual paper, we focus on preparing our autonomous vehicle for deployment in snow‑filled environments. We propose a novel method for diffusion‑based image augmentation to more closely represent the deployment environment in our training data. Diffusion‑based image augmentations rely on the public availability of vision foundation models learned on internet‑scale datasets. The diffusion‑based image augmentations allow us to take control over the semantic distribution of the ground surfaces in the training data and to fine‑tune our model for its deployment environment. We employ open vocabulary semantic segmentation models to filter out augmentation candidates that contain hallucinations. We believe that diffusion‑based image augmentations can be extended to many other environments apart from snow surfaces, like sandy environments and volcanic terrains.

Abstract:
Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt‑driven vision‑language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine‑tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state‑of‑the‑art segmentation methods, including UniverSeg, MedSAM, MedCLIP‑SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine‑tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ‑specific annotated datasets. We will publish our code on code.sonography.ai after acceptance.

Abstract:
Ultrasound (US) is widely accessible and radiation‑free but has a steep learning curve due to its dynamic nature and non‑standard imaging planes. Additionally, the constant need to shift focus between the US screen and the patient poses a challenge. To address these issues, we integrate deep learning (DL)‑based semantic segmentation for real‑time (RT) automated kidney volumetric measurements, which are essential for clinical assessment but are traditionally time‑consuming and prone to fatigue. This automation allows clinicians to concentrate on image interpretation rather than manual measurements. Complementing DL, augmented reality (AR) enhances the usability of US by projecting the display directly into the clinician's field of view, improving ergonomics and reducing the cognitive load associated with screen‑to‑patient transitions. Two AR‑DL‑assisted US pipelines on HoloLens‑2 are proposed: one streams directly via the application programming interface for a wireless setup, while the other supports any US device with video output for broader accessibility. We evaluate RT feasibility and accuracy using the Open Kidney Dataset and open‑source segmentation models (nnU‑Net, Segmenter, YOLO with MedSAM and LiteMedSAM). Our open‑source GitHub pipeline includes model implementations, measurement algorithms, and a Wi‑Fi‑based streaming solution, enhancing US training and diagnostics, especially in point‑of‑care settings.

Abstract:
Existing open‑vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text‑aligned features (e.g., CLIP) extracted from multi‑view images onto 3D points. However, such approaches treat multi‑view images merely as intermediaries for transferring open‑vocabulary information, overlooking their rich semantic content and cross‑view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial‑to‑Global curriculum for improving open‑vocabulary 3D semantic segmentation. The key innovation lies in a two‑stage training strategy. In the first stage, we pre‑train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi‑view RGB‑D inputs via pixel‑wise depth projection. To enable open‑vocabulary learning, we leverage a multi‑modal large language model (MLLM) and a 2D segmentation foundation model to generate open‑vocabulary labels for each viewpoint, offering rich and aligned supervision. An auxiliary inter‑frame consistency module is introduced to enforce feature consistency across varying viewpoints and enhance spatial understanding. In the second stage, we fine‑tune the model on complete scene‑level point clouds, which are sparser and structurally more complex. We aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre‑trained model, effectively bridging the semantic gap between dense partial observations and large‑scale 3D environments. Extensive experiments on ScanNet, ScanNet200, and S3DIS benchmarks demonstrate that PGOV3D achieves competitive performance in open‑vocabulary 3D semantic segmentation.

Abstract:
The rapid advancement of AI and computer vision has significantly increased the demand for high‑quality annotated datasets, particularly for semantic segmentation. However, creating such datasets is resource‑intensive, requiring substantial time, labor, and financial investment, and often raises privacy concerns due to the use of real‑world data. To mitigate these challenges, we present SynthLab, consisting of a modular platform for visual data synthesis and a user‑friendly interface. The modular architecture of SynthLab enables easy maintenance, scalability with centralized updates, and seamless integration of new features. Each module handles distinct aspects of computer vision tasks, enhancing flexibility and adaptability. Meanwhile, its interactive, user‑friendly interface allows users to quickly customize their data pipelines through drag‑and‑drop actions. Extensive user studies involving a diverse range of users across different ages, professions, and expertise levels, have demonstrated flexible usage, and high accessibility of SynthLab, enabling users without deep technical expertise to harness AI for real‑world applications.

Abstract:
Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM‑based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks in WSSS, leveraging its strong image generation capabilities tailored to specific class distributions. By modifying or perturbing the condition during diffusion sampling, the related objects can be highlighted in the generated images. Yet, the saliency maps generated by CDMs are prone to noise from background alterations during reverse diffusion. To alleviate the problem, we introduce Contrastive Learning with Diffusion Features (CLDF), a novel method that uses contrastive learning to train a pixel decoder to map the diffusion features from a frozen CDM to a low‑dimensional embedding space for segmentation. Specifically, we integrate gradient maps generated from CDM external classifier with CAMs to identify foreground and background pixels with fewer false positives/negatives for contrastive learning, enabling robust pixel embedding learning. Experimental results on four segmentation tasks from two public medical datasets demonstrate that our method significantly outperforms existing baselines.

Abstract:
Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non‑heat‑emitting targets like bicycles, which significantly affects the performance of downstream high‑level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task‑oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction‑based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state‑of‑the‑art methods.

Abstract:
Bronchopulmonary dysplasia (BPD) is a common complication among preterm neonates, with portable X‑ray imaging serving as the standard diagnostic modality in neonatal intensive care units (NICUs). However, lung magnetic resonance imaging (MRI) offers a non‑invasive alternative that avoids sedation and radiation while providing detailed insights into the underlying mechanisms of BPD. Leveraging high‑resolution 3D MRI data, advanced image processing and semantic segmentation algorithms can be developed to assist clinicians in identifying the etiology of BPD. In this dataset, we present MRI scans paired with corresponding semantic segmentations of the lungs and trachea for 40 neonates, the majority of whom are diagnosed with BPD. The imaging data consist of free‑breathing 3D stack‑of‑stars radial gradient echo acquisitions, known as the StarVIBE series. Additionally, we provide comprehensive clinical data and baseline segmentation models, validated against clinical assessments, to support further research and development in neonatal lung imaging.

Abstract:
Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi‑part objects). To overcome these challenges, we propose DC‑TTA, a novel test‑time adaptation (TTA) framework that adapts SAM on a per‑sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC‑TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide‑and‑Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC‑TTA significantly outperforms SAM's zero‑shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy.

Abstract:
Surface defect detection plays a critical role in industrial quality inspection. Recent advances in artificial intelligence have significantly enhanced the automation level of detection processes. However, conventional semantic segmentation and object detection models heavily rely on large‑scale annotated datasets, which conflicts with the practical requirements of defect detection tasks. This paper proposes a novel weakly supervised semantic segmentation framework comprising two key components: a region‑aware class activation map (CAM) and pseudo‑label training. To address the limitations of existing CAM methods, especially low‑resolution thermal maps, and insufficient detail preservation, we introduce filtering‑guided backpropagation (FGBP), which refines target regions by filtering gradient magnitudes to identify areas with higher relevance to defects. Building upon this, we further develop a region‑aware weighted module to enhance spatial precision. Finally, pseudo‑label segmentation is implemented to refine the model's performance iteratively. Comprehensive experiments on industrial defect datasets demonstrate the superiority of our method. The proposed framework effectively bridges the gap between weakly supervised learning and high‑precision defect segmentation, offering a practical solution for resource‑constrained industrial scenarios.

Abstract:
Text‑to‑image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole‑image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask‑aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two‑stage framework, comprising a first stage for segmentation‑aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha‑CLIP to extract region‑level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D^3 datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.

Abstract:
Recent open‑vocabulary 3D scene understanding approaches mainly focus on training 3D networks through contrastive learning with point‑text pairs or by distilling 2D features into 3D models via point‑pixel alignment. While these methods show considerable performance in benchmarks with limited vocabularies, they struggle to handle diverse object categories as the limited amount of 3D data upbound training strong open‑vocabulary 3d models. We observe that 2D multi‑view fusion methods take precedence in understanding diverse concepts in 3D scenes. However, inherent noises in vision‑language models lead multi‑view fusion to sub‑optimal performance. To this end, we introduce MVOV3D, a novel approach aimed at unleashing the potential of 2D multi‑view fusion for open‑vocabulary 3D scene understanding. We focus on reducing the inherent noises without training, thereby preserving the generalizability while enhancing open‑world capabilities. Specifically, MVOV3D improves multi‑view 2D features by leveraging precise region‑level image features and text features encoded by CLIP encoders and incorporates 3D geometric priors to optimize multi‑view fusion. Extensive experiments on various datasets demonstrate the effectiveness of our method. Notably, our MVOV3D achieves a new record with 14.7% mIoU on ScanNet200 and 16.2% mIoU on Matterport160 for challenge open‑vocabulary semantic segmentation, outperforming current leading trained 3D networks by a significant margin.

Abstract:
Agricultural image semantic segmentation is a pivotal component of modern agriculture, facilitating accurate visual data analysis to improve crop management, optimize resource utilization, and boost overall productivity. This study proposes an efficient image segmentation method for precision agriculture, focusing on accurately delineating farmland anomalies to support informed decision‑making and proactive interventions. A novel Dual Atrous Separable Convolution (DAS Conv) module is integrated within the DeepLabV3‑based segmentation framework. The DAS Conv module is meticulously designed to achieve an optimal balance between dilation rates and padding size, thereby enhancing model performance without compromising efficiency. The study also incorporates a strategic skip connection from an optimal stage in the encoder to the decoder to bolster the model's capacity to capture fine‑grained spatial features. Despite its lower computational complexity, the proposed model outperforms its baseline and achieves performance comparable to highly complex transformer‑based state‑of‑the‑art (SOTA) models on the Agriculture Vision benchmark dataset. It achieves more than 66% improvement in efficiency when considering the trade‑off between model complexity and performance, compared to the SOTA model. This study highlights an efficient and effective solution for improving semantic segmentation in remote sensing applications, offering a computationally lightweight model capable of high‑quality performance in agricultural imagery.

Abstract:
Zero‑shot Semantic Segmentation (ZSS) aims to segment both seen and unseen classes using supervision from only seen classes. Beyond adaptation‑based methods, distillation‑based approaches transfer vision‑language alignment of vision‑language model, e.g., CLIP, to segmentation models. However, such knowledge transfer remains challenging due to: (1) the difficulty of aligning vision‑based features with the textual space, which requires combining spatial precision with vision‑language alignment; and (2) the semantic gap between CLIP's global representations and the local, fine‑grained features of segmentation models. To address challenge (1), we propose Chimera‑Seg, which integrates a segmentation backbone as the body and a CLIP‑based semantic head as the head, like the Chimera in Greek mythology, combining spatial precision with vision‑language alignment. Specifically, Chimera‑Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP‑aligned space. The CSH incorporates a frozen subnetwork and fixed projection layers from the CLIP visual encoder, along with lightweight trainable components. The partial module from CLIP visual encoder, paired with the segmentation model, retains segmentation capability while easing the mapping to CLIP's semantic space. To address challenge (2), we propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token, while gradually reducing the number of features used for alignment as training progresses. Besides, we also use a Semantic Alignment Module (SAM) to further align dense visual features with semantic embeddings extracted from the frozen CLIP text encoder. Experiments on two benchmarks show improvements of 0.9% and 1.2% in hIoU.

Abstract:
Reliable semantic segmentation of open environments is essential for intelligent systems, yet significant problems remain: 1) Existing RGB‑T semantic segmentation models mainly rely on low‑level visual features and lack high‑level textual information, which struggle with accurate segmentation when categories share similar visual characteristics. 2) While SAM excels in instance‑level segmentation, integrating it with thermal images and text is hindered by modality heterogeneity and computational inefficiency. To address these, we propose TASeg, a text‑aware RGB‑T segmentation framework by using Low‑Rank Adaptation (LoRA) fine‑tuning technology to adapt vision foundation models. Specifically, we propose a Dynamic Feature Fusion Module (DFFM) in the image encoder, which effectively merges features from multiple visual modalities while freezing SAM's original transformer blocks. Additionally, we incorporate CLIP‑generated text embeddings in the mask decoder to enable semantic alignment, which further rectifies the classification error and improves the semantic understanding accuracy. Experimental results across diverse datasets demonstrate that our method achieves superior performance in challenging scenarios with fewer trainable parameters.

Abstract:
Land cover maps generated from semantic segmentation of high‑resolution remotely sensed images have drawn mucon in the photogrammetry and remote sensing research community. Currently, massive fine‑resolution remotely sensed (FRRS) images acquired by improving sensing and imaging technologies become available. However, accurate semantic segmentation of such FRRS images is greatly affected by substantial class disparities, the invisibility of key ground objects due to occlusion, and object size variation. Despite the extraordinary potential in deep convolutional neural networks (DCNNs) in image feature learning and representation, extracting sufficient features from FRRS images for accurate semantic segmentation is still challenging. These challenges demand the deep learning models to learn robust features and generate sufficient feature descriptors. Specifically, learning multi‑contextual features to guarantee adequate coverage of varied object sizes from the ground scene and harnessing global‑local contexts to overcome class disparities challenge even profound networks. Deeper networks significantly lose spatial details due to gradual downsampling processes resulting in poor segmentation results and coarse boundaries. This article presents a stacked deep residual network (SDRNet) for semantic segmentation from FRRS images. The proposed framework utilizes two stacked encoder‑decoder networks to harness long‑range semantics yet preserve spatial information and dilated residual blocks (DRB) between each encoder and decoder network to capture sufficient global dependencies thus improving segmentation performance. Our experimental results obtained using the ISPRS Vaihingen and Potsdam datasets demonstrate that the SDRNet performs effectively and competitively against current DCNNs in semantic segmentation.

Abstract:
Deep neural networks have set the state‑of‑the‑art in computer vision tasks such as bounding box detection and semantic segmentation. Object detectors and segmentation models assign confidence scores to predictions, reflecting the model's uncertainty in object detection or pixel‑wise classification. However, these confidence estimates are often miscalibrated, as their architectures and loss functions are tailored to task performance rather than probabilistic foundation. Even with well calibrated predictions, object detectors fail to quantify uncertainty outside detected bounding boxes, i.e., the model does not make a probability assessment of whether an area without detected objects is truly free of obstacles. This poses a safety risk in applications such as automated driving, where uncertainty in empty areas remains unexplored. In this work, we propose an object detection model grounded in spatial statistics. Bounding box data matches realizations of a marked point process, commonly used to describe the probabilistic occurrence of spatial point events identified as bounding box centers, where marks are used to describe the spatial extension of bounding boxes and classes. Our statistical framework enables a likelihood‑based training and provides well‑defined confidence estimates for whether a region is drivable, i.e., free of objects. We demonstrate the effectiveness of our method through calibration assessments and evaluation of performance.

Abstract:
This technical report presents submission systems for Task 4 of the DCASE 2025 Challenge. This model incorporates additional audio features (spectral roll‑off and chroma features) into the embedding feature extracted from the mel‑spectral feature to im‑prove the classification capabilities of an audio‑tagging model in the spatial semantic segmentation of sound scenes (S5) system. This approach is motivated by the fact that mixed audio often contains subtle cues that are difficult to capture with mel‑spectrograms alone. Thus, these additional features offer alterna‑tive perspectives for the model. Second, an agent‑based label correction system is applied to the outputs processed by the S5 system. This system reduces false positives, improving the final class‑aware signal‑to‑distortion ratio improvement (CA‑SDRi) metric. Finally, we refine the training dataset to enhance the classi‑fication accuracy of low‑performing classes by removing irrele‑vant samples and incorporating external data. That is, audio mix‑tures are generated from a limited number of data points; thus, even a small number of out‑of‑class data points could degrade model performance. The experiments demonstrate that the submit‑ted systems employing these approaches relatively improve CA‑SDRi by up to 14.7% compared to the baseline of DCASE 2025 Challenge Task 4.

Abstract:
The rapid advancement of vision‑language models (VLMs) in 3D domains has accelerated research in text‑query‑guided point cloud processing, though existing methods underperform in point‑level segmentation due to inadequate 3D‑text alignment that limits local feature‑text context linking. To address this limitation, we propose MR‑COSMO, a Visual‑Text Memory Recall and Direct CrOSs‑MOdal Alignment Method for Query‑Driven 3D Segmentation, establishing explicit alignment between 3D point clouds and text/2D image data through a dedicated direct cross‑modal alignment module while implementing a visual‑text memory module with specialized feature banks. This direct alignment mechanism enables precise fusion of geometric and semantic features, while the memory module employs specialized banks storing text features, visual features, and their correspondence mappings to dynamically enhance scene‑specific representations via attention‑based knowledge recall. Comprehensive experiments across 3D instruction, reference, and semantic segmentation benchmarks confirm state‑of‑the‑art performance.

Abstract:
Pathology image segmentation is crucial in computational pathology for analyzing histological features relevant to cancer diagnosis and prognosis. However, current methods face major challenges in clinical applications due to limited annotated data and restricted category definitions. To address these limitations, we propose PathSegmentor, the first text‑prompted segmentation foundation model designed specifically for pathology images. We also introduce PathSeg, the largest and most comprehensive dataset for pathology segmentation, built from 21 public sources and containing 275k image‑mask‑label triples across 160 diverse categories. With PathSegmentor, users can perform semantic segmentation using natural language prompts, eliminating the need for laborious spatial inputs such as points or boxes. Extensive experiments demonstrate that PathSegmentor outperforms specialized models with higher accuracy and broader applicability, while maintaining a compact architecture. It significantly surpasses existing spatial‑ and text‑prompted models by 0.145 and 0.429 in overall Dice scores, respectively, showing strong robustness in segmenting complex structures and generalizing to external datasets. Moreover, PathSegmentor's outputs enhance the interpretability of diagnostic models through feature importance estimation and imaging biomarker discovery, offering pathologists evidence‑based support for clinical decision‑making. This work advances the development of explainable AI in precision oncology.

Abstract:
Artificial intelligence, including deep learning models, will play a transformative role in automated medical image analysis for the diagnosis of cardiac disorders and their management. Automated accurate delineation of cardiac images is the first necessary initial step for the quantification and automated diagnosis of cardiac disorders. In this paper, we propose a deep learning based enhanced UNet model, U‑R‑Veda, which integrates convolution transformations, vision transformer, residual links, channel‑attention, and spatial attention, together with edge‑detection based skip‑connections for an accurate fully‑automated semantic segmentation of cardiac magnetic resonance (CMR) images. The model extracts local‑features and their interrelationships using a stack of combination convolution blocks, with embedded channel and spatial attention in the convolution block, and vision transformers. Deep embedding of channel and spatial attention in the convolution block identifies important features and their spatial localization. The combined edge information with channel and spatial attention as skip connection reduces information‑loss during convolution transformations. The overall model significantly improves the semantic segmentation of CMR images necessary for improved medical image analysis. An algorithm for the dual attention module (channel and spatial attention) has been presented. Performance results show that U‑R‑Veda achieves an average accuracy of 95.2%, based on DSC metrics. The model outperforms the accuracy attained by other models, based on DSC and HD metrics, especially for the delineation of right‑ventricle and left‑ventricle‑myocardium.

Abstract:
We introduce CogGen, a learner‑centered AI architecture that transforms programming videos into interactive, adaptive learning experiences by integrating student modeling with generative AI tutoring based on the Cognitive Apprenticeship framework. The architecture consists of three components: (1) video segmentation by learning goals, (2) a conversational tutoring engine applying Cognitive Apprenticeship strategies, and (3) a student model using Bayesian Knowledge Tracing to adapt instruction. Our technical evaluation demonstrates effective video segmentation accuracy and strong pedagogical alignment across knowledge, method, action, and interaction layers. Ablation studies confirm the necessity of each component in generating effective guidance. This work advances AI‑powered tutoring by bridging structured student modeling with interactive AI conversations, offering a scalable approach to enhancing video‑based programming education.

Abstract:
Rock bolts are crucial components of the subterranean support systems in underground mines that provide adequate structural reinforcement to the rock mass to prevent unforeseen hazards like rockfalls. This makes frequent assessments of such bolts critical for maintaining rock mass stability and minimising risks in underground mining operations. Where manual surveying of rock bolts is challenging due to the low light conditions in the underground mines and the time‑intensive nature of the process, automated detection of rock bolts serves as a plausible solution. To that end, this study focuses on the automatic identification of rock bolts within medium to large‑scale 3D point clouds obtained from underground mines using mobile laser scanners. Existing techniques for automated rock bolt identification primarily rely on feature engineering and traditional machine learning approaches. However, such techniques lack robustness as these point clouds present several challenges due to data noise, varying environments, and complex surrounding structures. Moreover, the target rock bolts are extremely small objects within large‑scale point clouds and are often partially obscured due to the application of reinforcement shotcrete. Addressing these challenges, this paper proposes an approach termed DeepBolt, which employs a novel two‑stage deep learning architecture specifically designed for handling severe class imbalance for the automatic and efficient identification of rock bolts in complex 3D point clouds. The proposed method surpasses state‑of‑the‑art semantic segmentation models by up to 42.5% in Intersection over Union (IoU) for rock bolt points. Additionally, it outperforms existing rock bolt identification techniques, achieving a 96.41% precision and 96.96% recall in classifying rock bolts, demonstrating its robustness and effectiveness in complex underground environments.

Abstract:
Foundation models are rapidly transforming Earth Observation data mining by enabling generalizable and scalable solutions for key tasks such as scene classification and semantic segmentation. While most efforts in the geospatial domain have focused on developing large models trained from scratch using massive Earth Observation datasets, an alternative strategy that remains underexplored is the reuse and combination of existing pretrained models. In this study, we investigate whether foundation models pretrained on remote sensing and general vision datasets can be effectively combined to improve performance across a diverse set of key Earth Observation tasks. Using the GEO‑Bench benchmark, we evaluate several prominent models, including Prithvi, Hiera, and DOFA, on eleven datasets covering a range of spatial resolutions, sensor modalities, and task types. The results show that feature‑level ensembling of smaller pretrained models can match or exceed the performance of much larger models, while requiring less training time and computational resources. Moreover, the study highlights the potential of applying knowledge distillation to transfer the strengths of ensembles into more compact models, offering a practical path for deploying foundation models in real‑world Earth Observation applications.

Abstract:
Multi‑sensor fusion perception (MSFP) is a key technology for embodied AI, which can serve a variety of downstream tasks (e.g., 3D object detection and semantic segmentation) and application scenarios (e.g., autonomous driving and swarm robotics). Recently, impressive achievements on AI‑based MSFP methods have been reviewed in relevant surveys. However, we observe that the existing surveys have some limitations after a rigorous and detailed investigation. For one thing, most surveys are oriented to a single task or research field, such as 3D object detection or autonomous driving. Therefore, researchers in other related tasks often find it difficult to benefit directly. For another, most surveys only introduce MSFP from a single perspective of multi‑modal fusion, while lacking consideration of the diversity of MSFP methods, such as multi‑view fusion and time‑series fusion. To this end, in this paper, we hope to organize MSFP research from a task‑agnostic perspective, where methods are reported from various technical views. Specifically, we first introduce the background of MSFP. Next, we review multi‑modal and multi‑agent fusion methods. A step further, time‑series fusion methods are analyzed. In the era of LLM, we also investigate multimodal LLM fusion methods. Finally, we discuss open challenges and future directions for MSFP. We hope this survey can help researchers understand the important progress in MSFP and provide possible insights for future research.

Abstract:
Inspired by the biological visual system that selectively allocates attention to efficiently identify salient objects or regions, underwater salient instance segmentation (USIS) aims to jointly address the problems of where to look (saliency prediction) and what is there (instance segmentation) in underwater scenarios. However, USIS remains an underexplored challenge due to the inaccessibility and dynamic nature of underwater environments, as well as the scarcity of large‑scale, high‑quality annotated datasets. In this paper, we introduce USIS16K, a large‑scale dataset comprising 16,151 high‑resolution underwater images collected from diverse environmental settings and covering 158 categories of underwater objects. Each image is annotated with high‑quality instance‑level salient object masks, representing a significant advance in terms of diversity, complexity, and scalability. Furthermore, we provide benchmark evaluations on underwater object detection and USIS tasks using USIS16K. To facilitate future research in this domain, the dataset and benchmark models are publicly available.

Abstract:
With the rapid development of ultra‑high resolution (UHR) remote sensing technology, the demand for accurate and efficient semantic segmentation has increased significantly. However, existing methods face challenges in computational efficiency and multi‑scale feature fusion. To address these issues, we propose GLCANet (Global‑Local Cross‑Attention Network), a lightweight segmentation framework designed for UHR remote sensing imagery.GLCANet employs a dual‑stream architecture to efficiently fuse global semantics and local details while minimizing GPU usage. A self‑attention mechanism enhances long‑range dependencies, refines global features, and preserves local details for better semantic consistency. A masked cross‑attention mechanism also adaptively fuses global‑local features, selectively enhancing fine‑grained details while exploiting global context to improve segmentation accuracy. Experimental results show that GLCANet outperforms state‑of‑the‑art methods regarding accuracy and computational efficiency. The model effectively processes large, high‑resolution images with a small memory footprint, providing a promising solution for real‑world remote sensing applications.

Abstract:
Open‑Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two‑stage paradigm: first segmenting objects, then classifying the segmented regions using Vision Language Models (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs' full‑image training and cropped‑region inference, and (2) depend on generic segmentation models optimized for well‑delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM‑guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM‑derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context‑aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.

Abstract:
We present AnchorDP3, a diffusion policy framework for dual‑arm robotic manipulation that achieves state‑of‑the‑art performance in highly randomized environments. AnchorDP3 integrates three key innovations: (1) Simulator‑Supervised Semantic Segmentation, using rendered ground truth to explicitly segment task‑critical objects within the point cloud, which provides strong affordance priors; (2) Task‑Conditioned Feature Encoders, lightweight modules processing augmented point clouds per task, enabling efficient multi‑task learning through a shared diffusion‑based action expert; (3) Affordance‑Anchored Keypose Diffusion with Full State Supervision, replacing dense trajectory prediction with sparse, geometrically meaningful action anchors, i.e., keyposes such as pre‑grasp pose, grasp pose directly anchored to affordances, drastically simplifying the prediction space; the action expert is forced to predict both robot joint angles and end‑effector poses simultaneously, which exploits geometric consistency to accelerate convergence and boost accuracy. Trained on large‑scale, procedurally generated simulation data, AnchorDP3 achieves a 98.7% average success rate in the RoboTwin benchmark across diverse tasks under extreme randomization of objects, clutter, table height, lighting, and backgrounds. This framework, when integrated with the RoboTwin real‑to‑sim pipeline, has the potential to enable fully autonomous generation of deployable visuomotor policies from only scene and instruction, totally eliminating human demonstrations from learning manipulation skills.

Abstract:
Continual Test Time Adaptation (CTTA) is a task that requires a source pre‑trained model to continually adapt to new scenarios with changing target distributions. Existing CTTA methods primarily focus on mitigating the challenges of catastrophic forgetting and error accumulation. Though there have been emerging methods based on forgetting adaptation with parameter‑efficient fine‑tuning, they still struggle to balance competitive performance and efficient model adaptation, particularly in complex tasks like semantic segmentation. In this paper, to tackle the above issues, we propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior‑knowledge, dubbed OoPk. Specifically, we first project a tuning subspace orthogonally which allows the model to adapt to new domains while preserving the knowledge integrity of the pre‑trained source model to alleviate catastrophic forgetting. Then, we elaborate an online prior‑knowledge aggregation strategy that employs an aggressive yet efficient image masking strategy to mimic potential target dynamism, enhancing the student model's domain adaptability. This further gradually ameliorates the teacher model's knowledge, ensuring high‑quality pseudo labels and reducing error accumulation. We demonstrate our method with extensive experiments that surpass previous CTTA methods and achieve competitive performances across various continual TTA benchmarks in semantic segmentation tasks.

Abstract:
Recent advances in autonomous driving (AD) have highlighted the potential of hyperspectral imaging (HSI) for enhanced environmental perception, particularly in challenging weather and lighting conditions. However, efficiently processing high‑dimensional spectral data remains a significant challenge. This paper presents an empirical investigation of a Multi‑Scale Attention Mechanism (MSAM) for enhanced spectral feature extraction through three parallel 1D convolutions with varying kernel sizes (1‑11) and adaptive feature aggregation. By integrating MSAM into UNet's skip connections, we evaluate performance improvements in semantic segmentation across multiple HSI datasets for urban driving scenarios. Comprehensive ablation studies demonstrate that MSAM consistently outperforms baseline UNet‑SC, achieving average improvements of 2.32% in mIoU and 2.88% in mF1, while maintaining competitive GPU performance against established attention mechanisms. Our findings reveal that optimal kernel combinations are dataset‑specific, with configurations such as (1;5;11) and (3;7;11) demonstrating particularly strong performance. This empirical investigation advances understanding of HSI processing capabilities for AD applications and establishes a foundation for adaptive multi‑scale spectral feature extraction in automotive deployment.

Abstract:
We introduce a novel end‑to‑end framework for jet reconstruction in high‑energy collider events, leveraging the efficiency and long‑range modeling capabilities of the Mamba architecture. Our model unifies instance segmentation, classification, and kinematic regression into a single multi‑task learning system, enabling a sophisticated multi‑level reconstruction that simultaneously identifies primary heavy jets (t, H, W/Z) and their constituent sub‑jets. To facilitate supervised learning for this complex task, we develop a novel method for assigning final‑state hadrons to their ancestor colored partons using a Mixed‑Integer Linear Programming solver, which generates high‑fidelity ground‑truth labels. The model achieves high classification accuracy, with an Average Precision score of 0.569 for W/Z‑jets and 0.568 for b‑jets, and shows exceptional precision in kinematic reconstruction. Furthermore, we show that the model not only maintains stable performance in high‑pileup environments but also successfully reconstructs the mass peaks of beyond the standard model particles. This work presents a powerful and versatile new tool for comprehensive event reconstruction at the LHC.

Abstract:
Semantic segmentation is commonly used for Oil Spill Detection (OSD) in remote sensing images. However, the limited availability of labelled oil spill samples and class imbalance present significant challenges that can reduce detection accuracy. Furthermore, most existing methods, which rely on convolutional neural networks (CNNs), struggle to detect small oil spill areas due to their limited receptive fields and inability to effectively capture global contextual information. This study explores the potential of State‑Space Models (SSMs), particularly Mamba, to overcome these limitations, building on their recent success in vision applications. We propose OSDMamba, the first Mamba‑based architecture specifically designed for oil spill detection. OSDMamba leverages Mamba's selective scanning mechanism to effectively expand the model's receptive field while preserving critical details. Moreover, we designed an asymmetric decoder incorporating ConvSSM and deep supervision to strengthen multi‑scale feature fusion, thereby enhancing the model's sensitivity to minority class samples. Experimental results show that the proposed OSDMamba achieves state‑of‑the‑art performance, yielding improvements of 8.9% and 11.8% in OSD across two publicly available datasets.

Abstract:
In‑context learning (ICL) enables generalization to new tasks with minimal labeled data. However, mainstream ICL approaches rely on a gridding strategy, which lacks the flexibility required for vision applications. We introduce Temporal, a time‑contrastive self‑supervised objective that pretrains a prompt retriever for visual ICL, and formulate ICL as a video object segmentation (VOS) task. Temporal addresses key limitations of grid‑based methods that restrict the number and resolution of context images. By reframing ICL as a VOS problem, our approach supports a variable number of context images while preserving their full resolution. To address the challenge of selecting optimal context sets for queries, we pretrain a prompt retriever on videos via self‑supervised learning, where adjacent frames serve as positives and distant frames as negatives. For image segmentation, the prompt retriever selects relevant sequences that, when combined with the query, form coherent videos for VOS processing. For video segmentation, it identifies keyframes, predicts their masks using our ICL pipeline, and propagates them throughout the sequence. When evaluated on MICCAI FLARE 2022, our method achieves substantial improvements over baselines: 90.95% Dice score for image segmentation (10.64% improvement) and 92.45% Dice for video segmentation (14.88% improvement).

Abstract:
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation‑aware perception and controllable object‑centric generation within an end‑to‑end framework. FOCUS employs a dual‑branch visual encoder to simultaneously capture global semantic context and fine‑grained spatial details. In addition, we leverage a MoVQGAN‑based visual tokenizer to produce discrete visual tokens that enhance generation quality. To enable accurate and controllable image editing, we propose a progressive multi‑stage training pipeline, where segmentation masks are jointly optimized and used as spatial condition prompts to guide the diffusion decoder. This strategy aligns visual encoding, segmentation, and generation modules, effectively bridging segmentation‑aware perception with fine‑grained visual synthesis. Extensive experiments across three core tasks, including multimodal understanding, referring segmentation accuracy, and controllable image generation, demonstrate that FOCUS achieves strong performance by jointly optimizing visual perception and generative capabilities.

Abstract:
Instance segmentation is essential for applications such as automated monitoring of plant health, growth, and yield. However, extensive effort is required to create large‑scale datasets with pixel‑level annotations of each object instance for developing instance segmentation models that restrict the use of deep learning in these areas. This challenge is more significant in images with densely packed, self‑occluded objects, which are common in agriculture. To address this challenge, we propose a semi‑self‑supervised learning approach that requires minimal manual annotation to develop a high‑performing instance segmentation model. We design GLMask, an image‑mask representation for the model to focus on shape, texture, and pattern while minimizing its dependence on color features. We develop a pipeline to generate semantic segmentation and then transform it into instance‑level segmentation. The proposed approach substantially outperforms the conventional instance segmentation models, establishing a state‑of‑the‑art wheat head instance segmentation model with mAP@50 of 98.5%. Additionally, we assessed the proposed methodology on the general‑purpose Microsoft COCO dataset, achieving a significant performance improvement of over 12.6% mAP@50. This highlights that the utility of our proposed approach extends beyond precision agriculture and applies to other domains, specifically those with similar data characteristics.

Abstract:
Referring Expression Segmentation (RES) enables precise object segmentation in images based on natural language descriptions, offering high flexibility and broad applicability in real‑world vision tasks. Despite its impressive performance, the robustness of RES models against adversarial examples remains largely unexplored. While prior adversarial attack methods have explored adversarial robustness on conventional segmentation models, they perform poorly when directly applied to RES models, failing to expose vulnerabilities in its multimodal structure. In practical open‑world scenarios, users typically issue multiple, diverse referring expressions to interact with the same image, highlighting the need for adversarial examples that generalize across varied textual inputs. Furthermore, from the perspective of privacy protection, ensuring that RES models do not segment sensitive content without explicit authorization is a crucial aspect of enhancing the robustness and security of multimodal vision‑language systems. To address these challenges, we present PEAT, an Embedding‑Guided Bidirectional Attack for RES models. Extensive experiments across multiple RES architectures and standard benchmarks show that PEAT consistently outperforms competitive baselines.

Abstract:
Unsupervised domain adaptation (UDA) methods effectively bridge domain gaps but become struggled when the source and target domains belong to entirely distinct modalities. To address this limitation, we propose a novel setting called Heterogeneous‑Modal Unsupervised Domain Adaptation (HMUDA), which enables knowledge transfer between completely different modalities by leveraging a bridge domain containing unlabeled samples from both modalities. To learn under the HMUDA setting, we propose Latent Space Bridging (LSB), a specialized framework designed for the semantic segmentation task. Specifically, LSB utilizes a dual‑branch architecture, incorporating a feature consistency loss to align representations across modalities and a domain alignment loss to reduce discrepancies between class centroids across domains. Extensive experiments conducted on six benchmark datasets demonstrate that LSB achieves state‑of‑the‑art performance.

Abstract:
We introduce a new task of open‑world object counting in videos: given a text description, or an image example, that specifies the target object, the objective is to enumerate all the unique instances of the target objects in the video. This task is especially challenging in crowded scenes with occlusions and objects of similar appearance, where avoiding double counting and identifying reappearances is crucial. To this end, we make the following contributions: we introduce a model, CountVid, for this task. It leverages an image‑based counting model, and a promptable video segmentation and tracking model, to enable automated open‑world object counting across video frames. To evaluate its performance, we introduce VideoCount, a new dataset for this novel task built from the TAO and MOT20 tracking datasets, as well as from videos of penguins and metal alloy crystallization captured by x‑rays. Using this dataset, we demonstrate that CountVid provides accurate object counts, and significantly outperforms strong baselines. The VideoCount dataset, the CountVid model, and all the code are available at https://www.robots.ox.ac.uk/~vgg/research/countvid/.

Abstract:
Foundation Models (FMs) have achieved state‑of‑the‑art performance across domains by leveraging large‑scale pretraining. In Earth Observation (EO), the availability of petabyte‑scale satellite archives has recently enabled the development of GeoSpatial Foundation Models (GFMs). Yet, fundamental questions remain regarding how dataset size, model architecture, and size interact to determine downstream performance. In this work, we systematically explore this design space by pretraining and fine‑tuning models on three dataset scales: PhilEO Globe (0.5TB), FastTOM (2TB, introduced here), and MajorTOM (23TB). We evaluate three architectural families: Geo‑Aware U‑Net (CNN), ViT‑UPerNet (Transformer), and Mamba (State‑Space Model); across model sizes ranging from 44M to 300M parameters. All models are benchmarked on the PhilEO Bench, covering: road density and building density regression, and land cover segmentation, and are compared against existing GFMs such as TerraMind and Prithvi‑EO‑2.0. Our results show that CNN‑based models remain highly competitive in low‑shot settings, with a 200M‑parameter Geo‑Aware U‑Net outperforming larger architectures on regression tasks. However, when scaling to multi‑terabyte datasets, ViT‑UPerNet achieves the best performance, particularly for semantic segmentation on MajorTOM (23TB). Finally, we provide the first extensive evaluation of Mamba models in EO, highlighting their potential efficiency advantages, though further large‑scale pretraining is required to fully match CNNs and ViTs. All code, pretrained models, and the FastTOM dataset are released publicly, enabling reproducibility and further exploration of scaling laws for GFMs.

Abstract:
This paper presents VisLanding, a monocular 3D perception‑based framework for safe UAV (Unmanned Aerial Vehicle) landing. Addressing the core challenge of autonomous UAV landing in complex and unknown environments, this study innovatively leverages the depth‑normal synergy prediction capabilities of the Metric3D V2 model to construct an end‑to‑end safe landing zones (SLZ) estimation framework. By introducing a safe zone segmentation branch, we transform the landing zone estimation task into a binary semantic segmentation problem. The model is fine‑tuned and annotated using the WildUAV dataset from a UAV perspective, while a cross‑domain evaluation dataset is constructed to validate the model's robustness. Experimental results demonstrate that VisLanding significantly enhances the accuracy of safe zone identification through a depth‑normal joint optimization mechanism, while retaining the zero‑shot generalization advantages of Metric3D V2. The proposed method exhibits superior generalization and robustness in cross‑domain testing compared to other approaches. Furthermore, it enables the estimation of landing zone area by integrating predicted depth and normal information, providing critical decision‑making support for practical applications.

Abstract:
Remote sensing semantic segmentation is crucial for extracting detailed land surface information, enabling applications such as environmental monitoring, land use planning, and resource assessment. In recent years, advancements in artificial intelligence have spurred the development of automatic remote sensing semantic segmentation methods. However, the existing semantic segmentation methods focus on distinguishing spectral characteristics of different objects while ignoring the differences in the elevation of the different targets. This results in land cover misclassification in complex scenarios involving shadow occlusion and spectral confusion. In this paper, we introduce a depth prompting two‑dimensional (2D) remote sensing semantic segmentation framework (DepthSeg). It automatically models depth/height information from 2D remote sensing images and integrates it into the semantic segmentation framework to mitigate the effects of spectral confusion and shadow occlusion. During the feature extraction phase of DepthSeg, we introduce a lightweight adapter to enable cost‑effective fine‑tuning of the large‑parameter vision transformer encoder pre‑trained by natural images. In the depth prompting phase, we propose a depth prompter to model depth/height features explicitly. In the semantic prediction phase, we introduce a semantic classification decoder that couples the depth prompts with high‑dimensional land‑cover features, enabling accurate extraction of land‑cover types. Experiments on the LiuZhou dataset validate the advantages of the DepthSeg framework in land cover mapping tasks. Detailed ablation studies further highlight the significance of the depth prompts in remote sensing semantic segmentation.

Abstract:
360 video captures the complete surrounding scenes with the ultra‑large field of view of 360X180. This makes 360 scene understanding tasks, eg, segmentation and tracking, crucial for appications, such as autonomous driving, robotics. With the recent emergence of foundation models, the community is, however, impeded by the lack of large‑scale, labelled real‑world datasets. This is caused by the inherent spherical properties, eg, severe distortion in polar regions, and content discontinuities, rendering the annotation costly yet complex. This paper introduces Leader360V, the first large‑scale, labeled real‑world 360 video datasets for instance segmentation and tracking. Our datasets enjoy high scene diversity, ranging from indoor and urban settings to natural and dynamic outdoor scenes. To automate annotation, we design an automatic labeling pipeline, which subtly coordinates pre‑trained 2D segmentors and large language models to facilitate the labeling. The pipeline operates in three novel stages. Specifically, in the Initial Annotation Phase, we introduce a Semantic‑ and Distortion‑aware Refinement module, which combines object mask proposals from multiple 2D segmentors with LLM‑verified semantic labels. These are then converted into mask prompts to guide SAM2 in generating distortion‑aware masks for subsequent frames. In the Auto‑Refine Annotation Phase, missing or incomplete regions are corrected either by applying the SDR again or resolving the discontinuities near the horizontal borders. The Manual Revision Phase finally incorporates LLMs and human annotators to further refine and validate the annotations. Extensive user studies and evaluations demonstrate the effectiveness of our labeling pipeline. Meanwhile, experiments confirm that Leader360V significantly enhances model performance for 360 video segmentation and tracking, paving the way for more scalable 360 scene understanding.

Abstract:
Vision Language Models (VLMs) provide rich semantic priors but are underexplored in Semi supervised Semantic Segmentation. Recent attempts to integrate VLMs to inject high level semantics overlook the semantic misalignment between visual and textual representations that arises from using domain invariant text embeddings without adapting them to dataset and image specific contexts. This lack of domain awareness, coupled with limited annotations, weakens the model semantic understanding by preventing effective vision language alignment. As a result, the model struggles with contextual reasoning, shows weak intra class discrimination, and confuses similar classes. To address these challenges, we propose Hierarchical Vision Language transFormer (HVLFormer), which achieves domain aware and domain robust alignment between visual and textual representations within a mask transformer architecture. Firstly, we transform text embeddings from pretrained VLMs into textual object queries, enabling the generation of multi scale, dataset aware queries that capture class semantics from coarse to fine granularity and enhance contextual reasoning. Next, we refine these queries by injecting image specific visual context to align textual semantics with local scene structures and enhance class discrimination. Finally, to achieve domain robustness, we introduce cross view and modal consistency regularization, which enforces prediction consistency within mask‑transformer architecture across augmented views. Moreover, it ensures stable vision language alignment during decoding. With less than 1% training data, HVLFormer outperforms state of the art methods on Pascal VOC, COCO, ADE20K, and Cityscapes. Our code and results will be available on GitHub.

Abstract:
Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes. In this survey, we present a holistic review of recent advances in VSP, covering a wide array of vision tasks, including Video Semantic Segmentation (VSS), Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), as well as Video Tracking and Segmentation (VTS), and Open‑Vocabulary Video Segmentation (OVVS). We systematically analyze the evolution from traditional hand‑crafted features to modern deep learning paradigms ‑‑ spanning from fully convolutional networks to the latest transformer‑based architectures ‑‑ and assess their effectiveness in capturing both local and global temporal contexts. Furthermore, our review critically discusses the technical challenges, ranging from maintaining temporal consistency to handling complex scene dynamics, and offers a comprehensive comparative study of datasets and evaluation metrics that have shaped current benchmarking standards. By distilling the key contributions and shortcomings of state‑of‑the‑art methodologies, this survey highlights emerging trends and prospective research directions that promise to further elevate the robustness and adaptability of VSP in real‑world applications.

Abstract:
Recycling steel scrap can reduce carbon dioxide (CO2) emissions from the steel industry. However, a significant challenge in steel scrap recycling is the inclusion of impurities other than steel. To address this issue, we propose vision‑language‑model‑based anomaly detection where a model is finetuned in a supervised manner, enabling it to handle niche objects effectively. This model enables automated detection of anomalies at a fine‑grained level within steel scrap. Specifically, we finetune the image encoder, equipped with multi‑scale mechanism and text prompts aligned with both normal and anomaly images. The finetuning process trains these modules using a multiclass classification as the supervision.

Abstract:
Autonomous vehicles that navigate in open‑world environments may encounter previously unseen object classes. However, most existing LiDAR panoptic segmentation models rely on closed‑set assumptions, failing to detect unknown object instances. In this work, we propose ULOPS, an uncertainty‑guided open‑set panoptic segmentation framework that leverages Dirichlet‑based evidential learning to model predictive uncertainty. Our architecture incorporates separate decoders for semantic segmentation with uncertainty estimation, embedding with prototype association, and instance center prediction. During inference, we leverage uncertainty estimates to identify and segment unknown instances. To strengthen the model's ability to differentiate between known and unknown objects, we introduce three uncertainty‑driven loss functions. Uniform Evidence Loss to encourage high uncertainty in unknown regions. Adaptive Uncertainty Separation Loss ensures a consistent difference in uncertainty estimates between known and unknown objects at a global scale. Contrastive Uncertainty Loss refines this separation at the fine‑grained level. To evaluate open‑set performance, we extend benchmark settings on KITTI‑360 and introduce a new open‑set evaluation for nuScenes. Extensive experiments demonstrate that ULOPS consistently outperforms existing open‑set LiDAR panoptic segmentation methods.

Abstract:
The Deep Learning Visual Space Simulation System (DLVS3) introduces a novel synthetic dataset generator and a simulation pipeline specifically designed for training and testing satellite pose estimation solutions. This work introduces the DLVS3‑HST‑V1 dataset, which focuses on the Hubble Space Telescope (HST) as a complex, articulated target. The dataset is generated using advanced real‑time and offline rendering technologies, integrating high‑fidelity 3D models, dynamic lighting (including secondary sources like Earth reflection), and physically accurate material properties. The pipeline supports the creation of large‑scale, richly annotated image sets with ground‑truth 6‑DoF pose and keypoint data, semantic segmentation, depth, and normal maps. This enables the training and benchmarking of deep learning‑based pose estimation solutions under realistic, diverse, and challenging visual conditions. The paper details the dataset generation process, the simulation architecture, and the integration with deep learning frameworks, and positions DLVS3 as a significant step toward closing the domain gap for autonomous spacecraft operations in proximity and servicing missions.

Abstract:
The segmentation of coal maceral groups can be described as a semantic segmentation process of coal maceral group images, which is of great significance for studying the chemical properties of coal. Generally, existing semantic segmentation models of coal maceral groups use the method of stacking parameters to achieve higher accuracy. It leads to increased computational requirements and impacts model training efficiency. At the same time, due to the professionalism and diversity of coal maceral group images sampling, obtaining the number of samples for model training requires a long time and professional personnel operation. To address these issues, We have innovatively developed an IoT‑based DA‑VIT parallel network model. By utilizing this model, we can continuously broaden the dataset through IoT and achieving sustained improvement in the accuracy of coal maceral groups segmentation. Besides, we decouple the parallel network from the backbone network to ensure the normal using of the backbone network during model data updates. Secondly, DCSA mechanism of DA‑VIT is introduced to enhance the local feature information of coal microscopic images. This DCSA can decompose the large kernels of convolutional attention into multiple scales and reduce 81.18% of parameters.Finally, we performed the contrast experiment and ablation experiment between DA‑VIT and state‑of‑the‑art methods at lots of evaluation metrics. Experimental results show that DA‑VIT‑Base achieves 92.14% pixel accuracy and 63.18% mIoU. Params and FLOPs of DA‑VIT‑Tiny are 4.95M and 8.99G, respectively. All of the evaluation metrics of the proposed DA‑VIT are better than other state‑of‑the‑art methods.

Abstract:
We study multi‑modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic‑level video summarization, and are not suitable for providing step‑by‑step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state‑of‑the‑art multi‑modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.

Abstract:
Instance segmentation of ships in synthetic aperture radar (SAR) imagery is critical for applications such as maritime monitoring, environmental analysis, and national security. SAR ship images present challenges including scale variation, object density, and fuzzy target boundary, which are often overlooked in existing methods, leading to suboptimal performance. In this work, we propose O2Former, a tailored instance segmentation framework that extends Mask2Former by fully leveraging the structural characteristics of SAR imagery. We introduce two key components. The first is the Optimized Query Generator(OQG). It enables multi‑scale feature interaction by jointly encoding shallow positional cues and high‑level semantic information. This improves query quality and convergence efficiency. The second component is the Orientation‑Aware Embedding Module(OAEM). It enhances directional sensitivity through direction‑aware convolution and polar‑coordinate encoding. This effectively addresses the challenge of uneven target orientations in SAR scenes. Together, these modules facilitate precise feature alignment from backbone to decoder and strengthen the model's capacity to capture fine‑grained structural details. Extensive experiments demonstrate that O2Former outperforms state of the art instance segmentation baselines, validating its effectiveness and generalization on SAR ship datasets.

Abstract:
Active Label Correction (ALC) has emerged as a promising solution to the high cost and error‑prone nature of manual pixel‑wise annotation in semantic segmentation, by actively identifying and correcting mislabeled data. Although recent work has improved correction efficiency by generating pseudo‑labels using foundation models, substantial inefficiencies still remain. In this paper, we introduce A^2LC, an Active and Automated Label Correction framework for semantic segmentation, where manual and automatic correction stages operate in a cascaded manner. Specifically, the automatic correction stage leverages human feedback to extend label corrections beyond the queried samples, thereby maximizing cost efficiency. In addition, we introduce an adaptively balanced acquisition function that emphasizes underrepresented tail classes, working in strong synergy with the automatic correction stage. Extensive experiments on Cityscapes and PASCAL VOC 2012 demonstrate that A^2LC significantly outperforms previous state‑of‑the‑art methods. Notably, A^2LC exhibits high efficiency by outperforming previous methods with only 20% of their budget, and shows strong effectiveness by achieving a 27.23% performance gain under the same budget on Cityscapes.

Abstract:
We introduce OV‑MAP, a novel approach to open‑world 3D mapping for mobile robots by integrating open‑features into 3D maps to enhance object recognition capabilities. A significant challenge arises when overlapping features from adjacent voxels reduce instance‑level precision, as features spill over voxel boundaries, blending neighboring regions together. Our method overcomes this by employing a class‑agnostic segmentation model to project 2D masks into 3D space, combined with a supplemented depth image created by merging raw and synthetic depth from point clouds. This approach, along with a 3D mask voting mechanism, enables accurate zero‑shot 3D instance segmentation without relying on 3D supervised segmentation models. We assess the effectiveness of our method through comprehensive experiments on public datasets such as ScanNet200 and Replica, demonstrating superior zero‑shot performance, robustness, and adaptability across diverse environments. Additionally, we conducted real‑world experiments to demonstrate our method's adaptability and robustness when applied to diverse real‑world environments.

Abstract:
Recent advances in deep learning have transformed computer‑assisted intervention and surgical video analysis, driving improvements not only in surgical training, intraoperative decision support, and patient outcomes, but also in postoperative documentation and surgical discovery. Central to these developments is the availability of large, high‑quality annotated datasets. In gynecologic laparoscopy, surgical scene understanding and action recognition are fundamental for building intelligent systems that assist surgeons during operations and provide deeper analysis after surgery. However, existing datasets are often limited by small scale, narrow task focus, or insufficiently detailed annotations, limiting their utility for comprehensive, end‑to‑end workflow analysis. To address these limitations, we introduce GynSurg, the largest and most diverse multi‑task dataset for gynecologic laparoscopic surgery to date. GynSurg provides rich annotations across multiple tasks, supporting applications in action recognition, semantic segmentation, surgical documentation, and discovery of novel procedural insights. We demonstrate the dataset quality and versatility by benchmarking state‑of‑the‑art models under a standardized training protocol. To accelerate progress in the field, we publicly release the GynSurg dataset and its annotations

Abstract:
Semi‑supervised semantic segmentation (SSSS) faces persistent challenges in effectively leveraging unlabeled data, such as ineffective utilization of pseudo‑labels, exacerbation of class imbalance biases, and neglect of prediction uncertainty. Current approaches often discard uncertain regions through strict thresholding favouring dominant classes. To address these limitations, we introduce a holistic framework that transforms uncertainty into a learning asset through four principal components: (1) fuzzy pseudo‑labeling, which preserves soft class distributions from top‑K predictions to enrich supervision; (2) uncertainty‑aware dynamic weighting, that modulate pixel‑wise contributions via entropy‑based reliability scores; (3) adaptive class rebalancing, which dynamically adjust losses to counteract long‑tailed class distributions; and (4) lightweight contrastive regularization, that encourage compact and discriminative feature embeddings. Extensive experiments on benchmarks demonstrate that our method outperforms current state‑of‑the‑art approaches, achieving significant improvements in the segmentation of under‑represented classes and ambiguous regions.

Abstract:
We would like to present a comprehensive study on the classification of iron ore pellets, aimed at identifying quality violations in the final product, alongside the development of an innovative imagebased measurement method utilizing the StarDist algorithm, which is primarily employed in the medical field. This initiative is motivated by the necessity to accurately identify and analyze objects within densely packed and unstable environments. The process involves segmenting these objects, determining their contours, classifying them, and measuring their physical dimensions. This is crucial because the size distribution and classification of pellets such as distinguishing between nice (quality) and joint (caused by the presence of moisture or indicating a process of production failure) types are among the most significant characteristics that define the quality of the final product. Traditional algorithms, including image classification techniques using Vision Transformer (ViT), instance segmentation methods like Mask R‑CNN, and various anomaly segmentation algorithms, have not yielded satisfactory results in this context. Consequently, we explored methodologies from related fields to enhance our approach. The outcome of our research is a novel method designed to detect objects with smoothed boundaries. This advancement significantly improves the accuracy of physical dimension measurements and facilitates a more precise analysis of size distribution among the iron ore pellets. By leveraging the strengths of the StarDist algorithm, we aim to provide a robust solution that addresses the challenges posed by the complex nature of pellet classification and measurement.

Abstract:
Spatial Semantic Segmentation of Sound Scenes (S5) aims to enhance technologies for sound event detection and separation from multi‑channel input signals that mix multiple sound events with spatial information. This is a fundamental basis of immersive communication. The ultimate goal is to separate sound event signals with 6 Degrees of Freedom (6DoF) information into dry sound object signals and metadata about the object type (sound event class) and representing spatial information, including direction. However, because several existing challenge tasks already provide some of the subset functions, this task for this year focuses on detecting and separating sound events from multi‑channel spatial input signals. This paper outlines the S5 task setting of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge Task 4 and the DCASE2025 Task 4 Dataset, newly recorded and curated for this task. We also report experimental results for an S5 system trained and evaluated on this dataset. The full version of this paper will be published after the challenge results are made public.

Abstract:
Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high‑fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi‑directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one‑step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one‑to‑one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel‑level and image‑level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state‑of‑the‑art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask‑HQ and 7.0 on COCO‑Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks.

Abstract:
This paper introduces ALBERT, an instance segmentation model specifically designed for comprehensive car damage and part segmentation. Leveraging the power of Bidirectional Encoder Representations, ALBERT incorporates advanced localization mechanisms to accurately identify and differentiate between real and fake damages, as well as segment individual car parts. The model is trained on a large‑scale, richly annotated automotive dataset that categorizes damage into 26 types, identifies 7 fake damage variants, and segments 61 distinct car parts. Our approach demonstrates strong performance in both segmentation accuracy and damage classification, paving the way for intelligent automotive inspection and assessment applications.

Abstract:
The Reference Remote Sensing Image Segmentation (RRSIS) task generates segmentation masks for specified objects in images based on textual descriptions, which has attracted widespread attention and research interest. Current RRSIS methods rely on multi‑modal fusion backbones and semantic segmentation heads but face challenges like dense annotation requirements and complex scene interpretation. To address these issues, we propose a framework named prompt‑generated semantic localization guiding Segment Anything Model(PSLG‑SAM), which decomposes the RRSIS task into two stages: coarse localization and fine segmentation. In coarse localization stage, a visual grounding network roughly locates the text‑described object. In fine segmentation stage, the coordinates from the first stage guide the Segment Anything Model (SAM), enhanced by a clustering‑based foreground point generator and a mask boundary iterative optimization strategy for precise segmentation. Notably, the second stage can be train‑free, significantly reducing the annotation data burden for the RRSIS task. Additionally, decomposing the RRSIS task into two stages allows for focusing on specific region segmentation, avoiding interference from complex scenes.We further contribute a high‑quality, multi‑category manually annotated dataset. Experimental validation on two datasets (RRSIS‑D and RRSIS‑M) demonstrates that PSLG‑SAM achieves significant performance improvements and surpasses existing state‑of‑the‑art models.Our code will be made publicly available.

Abstract:
This work demonstrates how autonomously learning aspects of robotic operation from sparsely‑labeled, real‑world data of deployed, engineered solutions at industrial scale can provide with solutions that achieve improved performance. Specifically, it focuses on multi‑suction robot picking and performs a comprehensive study on the application of multi‑modal visual encoders for predicting the success of candidate robotic picks. Picking diverse items from unstructured piles is an important and challenging task for robot manipulation in real‑world settings, such as warehouses. Methods for picking from clutter must work for an open set of items while simultaneously meeting latency constraints to achieve high throughput. The demonstrated approach utilizes multiple input modalities, such as RGB, depth and semantic segmentation, to estimate the quality of candidate multi‑suction picks. The strategy is trained from real‑world item picking data, with a combination of multimodal pretrain and finetune. The manuscript provides comprehensive experimental evaluation performed over a large item‑picking dataset, an item‑picking dataset targeted to include partial occlusions, and a package‑picking dataset, which focuses on containers, such as boxes and envelopes, instead of unpackaged items. The evaluation measures performance for different item configurations, pick scenes, and object types. Ablations help to understand the effects of in‑domain pretraining, the impact of different modalities and the importance of finetuning. These ablations reveal both the importance of training over multiple modalities but also the ability of models to learn during pretraining the relationship between modalities so that during finetuning and inference, only a subset of them can be used as input.

Abstract:
Estimating the 6D pose of objects from RGBD data is a fundamental problem in computer vision, with applications in robotics and augmented reality. A key challenge is achieving generalization to novel objects that were not seen during training. Most existing approaches address this by scaling up training on synthetic data tailored to the task, a process that demands substantial computational resources. But is task‑specific training really necessary for accurate and efficient 6D pose estimation of novel objects? To answer No!, we introduce FreeZeV2, the second generation of FreeZe: a training‑free method that achieves strong generalization to unseen objects by leveraging geometric and vision foundation models pre‑trained on unrelated data. FreeZeV2 improves both accuracy and efficiency over FreeZe through three key contributions: (i) a sparse feature extraction strategy that reduces inference‑time computation without sacrificing accuracy; (ii) a feature‑aware scoring mechanism that improves both pose selection during RANSAC‑based 3D registration and the final ranking of pose candidates; and (iii) a modular design that supports ensembles of instance segmentation models, increasing robustness to segmentation masks errors. We evaluate FreeZeV2 on the seven core datasets of the BOP Benchmark, where it establishes a new state‑of‑the‑art in 6D pose estimation of unseen objects. When using the same segmentation masks, FreeZeV2 achieves a remarkable 8x speedup over FreeZe while also improving accuracy by 5%. When using ensembles of segmentation models, FreeZeV2 gains an additional 8% in accuracy while still running 2.5x faster than FreeZe. FreeZeV2 was awarded Best Overall Method at the BOP Challenge 2024.

Abstract:
The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource‑constrained devices. In this paper, we present Q‑SAM2, an accurate low‑bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q‑SAM2 introduces two novel contributions: Variance‑Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization‑Aware Training (QAT) method that learns momentum‑stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q‑SAM2 achieves highly accurate inference with substantial efficiency gains, significantly surpassing state‑of‑the‑art general QAT schemes, particularly in the ultra‑low 2‑bit regime. Specifically, Q‑SAM2 achieves an accuracy gain of up to 9.7 ppt in J&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.

Abstract:
The robust interpretation of 3D environments is crucial for human‑robot collaboration (HRC) applications, where safety and operational efficiency are paramount. Semantic segmentation plays a key role in this context by enabling a precise and detailed understanding of the environment. Considering the intense data hunger for real‑world industrial annotated data essential for effective semantic segmentation, this paper introduces a pioneering approach in the Sim2Real domain adaptation for semantic segmentation of 3D point cloud data, specifically tailored for HRC. Our focus is on developing a network that robustly transitions from simulated environments to real‑world applications, thereby enhancing its practical utility and impact on a safe HRC. In this work, we propose a dual‑stream network architecture (FUSION) combining Dynamic Graph Convolutional Neural Networks (DGCNN) and Convolutional Neural Networks (CNN) augmented with residual layers as a Sim2Real domain adaptation algorithm for an industrial environment. The proposed model was evaluated on real‑world HRC setups and simulation industrial point clouds, it showed increased state‑of‑the‑art performance, achieving a segmentation accuracy of 97.76%, and superior robustness compared to existing methods.

Abstract:
Incompletely‑Supervised Concealed Object Segmentation (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments, utilizing incompletely annotated data, such as weak and semi‑annotations, for model training. This task remains highly challenging due to (1) the limited supervision provided by the incompletely annotated training data, and (2) the difficulty of distinguishing concealed objects from the background, which arises from the intrinsic similarities in concealed scenarios. In this paper, we introduce the first unified method for ISCOS to address these challenges. To tackle the issue of incomplete supervision, we propose a unified mean‑teacher framework, SEE, that leverages the vision foundation model, ``\emphSegment Anything Model (SAM)'', to generate pseudo‑labels using coarse masks produced by the teacher model as prompts. To mitigate the effect of low‑quality segmentation masks, we introduce a series of strategies for pseudo‑label generation, storage, and supervision. These strategies aim to produce informative pseudo‑labels, store the best pseudo‑labels generated, and select the most reliable components to guide the student model, thereby ensuring robust network training. Additionally, to tackle the issue of intrinsic similarity, we design a hybrid‑granularity feature grouping module that groups features at different granularities and aggregates these results. By clustering similar features, this module promotes segmentation coherence, facilitating more complete segmentation for both single‑object and multiple‑object images. We validate the effectiveness of our approach across multiple ISCOS tasks, and experimental results demonstrate that our method achieves state‑of‑the‑art performance. Furthermore, SEE can serve as a plug‑and‑play solution, enhancing the performance of existing models.

Abstract:
Semantic segmentation in remote sensing images is crucial for various applications, yet its performance is heavily reliant on large‑scale, high‑quality pixel‑wise annotations, which are notoriously expensive and time‑consuming to acquire. Semi‑supervised semantic segmentation (SSS) offers a promising alternative to mitigate this data dependency. However, existing SSS methods often struggle with the inherent distribution mismatch between limited labeled data and abundant unlabeled data, leading to suboptimal generalization. To alleviate this issue, we attempt to introduce the Vision Foundation Models (VFMs) pre‑trained on vast and diverse datasets into the SSS task since VFMs possess robust generalization capabilities that can effectively bridge this distribution gap and provide strong semantic priors for SSS. Inspired by this, we introduce RS‑MTDF (Multi‑Teacher Distillation and Fusion), a novel framework that leverages the powerful semantic knowledge embedded in VFMs to guide semi‑supervised learning in remote sensing. Specifically, RS‑MTDF employs multiple frozen VFMs (e.g., DINOv2 and CLIP) as expert teachers, utilizing feature‑level distillation to align student features with their robust representations. To further enhance discriminative power, the distilled knowledge is seamlessly fused into the student decoder. Extensive experiments on three challenging remote sensing datasets demonstrate that RS‑MTDF consistently achieves state‑of‑the‑art performance. Notably, our method outperforms existing approaches across various label ratios on LoveDA and secures the highest IoU in the majority of semantic categories. These results underscore the efficacy of multi‑teacher VFM guidance in significantly enhancing both generalization and semantic understanding for remote sensing segmentation. Ablation studies further validate the contribution of each proposed module.

Abstract:
Vision‑language models such as CLIP have recently propelled open‑vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine‑grained, region‑level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine‑grained vision‑language alignment. Current adaptation methods often improve fine‑grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine‑tuning. To overcome these issues, we propose Any‑to‑Any Self‑Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine‑grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self‑distillation process to refine representations of CLIP vision encoders, preserving local semantic consistency while sharpening local detail recognition. On open‑vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine‑grained alignment for advanced open‑vocabulary dense prediction.

Abstract:
In the past decade, Convolutional Neural Networks (CNNs) and Transformers have achieved wide applicaiton in semantic segmentation tasks. Although CNNs with Transformer models greatly improve performance, the global context modeling remains inadequate. Recently, Mamba achieved great potential in vision tasks, showing its advantages in modeling long‑range dependency. In this paper, we propose a lightweight Efficient CNN‑Mamba Network for semantic segmentation, dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule‑based framework to address their complementary weaknesses. Specifically, We design a Enhanced Dual‑Attention Block (EDAB) for lightweight bottleneck. In order to improve the representations ability of feature, We devise a Multi‑Scale Attention Unit (MSAU) to integrate multi‑scale feature aggregation, spatial aggregation and channel aggregation. Moreover, a Mamba enhanced Feature Fusion Module (FFM) merges diverse level feature, significantly enhancing segmented accuracy. Extensive experiments on two representative datasets demonstrate that the proposed model excels in accuracy and efficiency balance, achieving 70.6% mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87M parameters and 8.27G FLOPs on a single RTX 3090 GPU platform.

Abstract:
Accurate segmentation of anatomical structures in the apical four‑chamber (A4C) view of fetal echocardiography is essential for early diagnosis and prenatal evaluation of congenital heart disease (CHD). However, precise segmentation remains challenging due to ultrasound artifacts, speckle noise, anatomical variability, and boundary ambiguity across different gestational stages. To reduce the workload of sonographers and enhance segmentation accuracy, we propose DCD, an advanced deep learning‑based model for automatic segmentation of key anatomical structures in the fetal A4C view. Our model incorporates a Dense Atrous Spatial Pyramid Pooling (Dense ASPP) module, enabling superior multi‑scale feature extraction, and a Convolutional Block Attention Module (CBAM) to enhance adaptive feature representation. By effectively capturing both local and global contextual information, DCD achieves precise and robust segmentation, contributing to improved prenatal cardiac assessment.

Abstract:
Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State‑of‑the‑art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well‑annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph‑level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module‑incorporating RGB and additional modalities (NDWI, DEM)‑with a graph‑based ground‑truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph‑theoretic constraint to complete road networks.

Abstract:
Vision‑Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets, as creating paired visual‑textual annotations is labor‑intensive and expensive. To address this bottleneck, we introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset‑specific training. Our approach consists of two key components: SMART‑OD, a robust object detection system that combines automatic mask generation with open‑world object detection capabilities, and FLASH (Frame‑Level Annotation and Segmentation Handler), a multi‑object real‑time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps. Unlike existing open‑world detection methods that require frame‑specific hyperparameter tuning and suffer from numerous false positives, our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences. Extensive experimental validation demonstrates that SAM2Auto achieves comparable accuracy to manual annotation while dramatically reducing annotation time and eliminating labor costs. The system successfully handles diverse datasets without requiring retraining or extensive parameter adjustments, making it a practical solution for large‑scale dataset creation. Our work establishes a new baseline for automated video annotation and provides a pathway for accelerating VLM development by addressing the fundamental dataset bottleneck that has constrained progress in vision‑language understanding.

Abstract:
Semantic segmentation of ultra‑high‑resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces computational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi‑branch networks address this trade‑off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a frequency‑aware framework that decomposes UHR images into high‑ and low‑frequency components for specialized processing. The high‑frequency branch preserves full‑resolution structural details, while the low‑frequency branch processes downsampled inputs through dual sub‑branches capturing short‑ and long‑range dependencies. A Hybrid‑Frequency Fusion module integrates these observations, guided by two novel objectives: Cross‑Frequency Alignment Loss ensures semantic consistency between frequency components, and Cross‑Frequency Balance Loss regulates gradient magnitudes across branches to stabilize training. Evaluated on DeepGlobe and Inria Aerial benchmarks, F2Net achieves state‑of‑the‑art performance with mIoU of 80.22 and 83.39, respectively. Our code will be publicly available.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction, offering high‑quality novel view synthesis while maintaining computational efficiency. In this paper, we extend the capabilities of 3DGS beyond pure scene representation by introducing an approach for open‑vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D. Our method leverages feature‑splatting techniques to associate semantic information with individual Gaussians, enabling fine‑grained scene understanding. We incorporate Segment Anything Model instance masks with a contrastive loss formulation as guidance for the instance features to achieve accurate instance‑level segmentation. Furthermore, we utilize language embeddings of a vision‑language model, allowing for flexible, text‑driven instance identification. This combination enables our system to identify and segment arbitrary objects in 3D scenes based on natural language descriptions. We show results on LERF‑mask and LERF‑OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.

Abstract:
3D Gaussian Splatting has achieved remarkable success in reconstructing both static and dynamic 3D scenes. However, in a scene represented by 3D Gaussian primitives, interactions between objects suffer from inaccurate 3D segmentation, imprecise deformation among different materials, and severe rendering artifacts. To address these challenges, we introduce PIG: Physically‑Based Multi‑Material Interaction with 3D Gaussians, a novel approach that combines 3D object segmentation with the simulation of interacting objects in high precision. Firstly, our method facilitates fast and accurate mapping from 2D pixels to 3D Gaussians, enabling precise 3D object‑level segmentation. Secondly, we assign unique physical properties to correspondingly segmented objects within the scene for multi‑material coupled interactions. Finally, we have successfully embedded constraint scales into deformation gradients, specifically clamping the scaling and rotation properties of the Gaussian primitives to eliminate artifacts and achieve geometric fidelity and visual consistency. Experimental results demonstrate that our method not only outperforms the state‑of‑the‑art (SOTA) in terms of visual quality, but also opens up new directions and pipelines for the field of physically realistic scene generation.

Abstract:
Accurate ultrasound image segmentation is a prerequisite for precise biometrics and accurate assessment. Relying on manual delineation introduces significant errors and is time‑consuming. However, existing segmentation models are designed based on objects in natural scenes, making them difficult to adapt to ultrasound objects with high noise and high similarity. This is particularly evident in small object segmentation, where a pronounced jagged effect occurs. Therefore, this paper proposes a fetal femur and cranial ultrasound image segmentation model based on feature perception and Mamba enhancement to address these challenges. Specifically, a longitudinal and transverse independent viewpoint scanning convolution block and a feature perception module were designed to enhance the ability to capture local detail information and improve the fusion of contextual information. Combined with the Mamba‑optimized residual structure, this design suppresses the interference of raw noise and enhances local multi‑dimensional scanning. The system builds global information and local feature dependencies, and is trained with a combination of different optimizers to achieve the optimal solution. After extensive experimental validation, the FAMSeg network achieved the fastest loss reduction and the best segmentation performance across images of varying sizes and orientations.

Abstract:
Cross‑domain few‑shot segmentation (CD‑FSS) is proposed to pre‑train the model on a source‑domain dataset with sufficient samples, and then transfer the model to target‑domain datasets where only a few samples are available for efficient fine‑tuning. There are majorly two challenges in this task: (1) the domain gap and (2) fine‑tuning with scarce data. To solve these challenges, we revisit the adapter‑based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine‑tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and find the model's inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure‑based decoupler instead of loss‑based ones like current works, to capture domain‑specific information, thereby directing the model's attention towards domain‑agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source‑domain training, we further design the SAM‑SVN method to constrain DFN from learning sample‑specific knowledge. On target domains, we freeze the model and fine‑tune the DFN to learn target‑specific knowledge specific. Extensive experiments demonstrate that our method surpasses the state‑of‑the‑art method in CD‑FSS significantly by 2.69% and 4.68% MIoU in 1‑shot and 5‑shot scenarios, respectively.

Abstract:
Retrieval‑Augmented Generation (RAG) systems require corpora that are both structurally clean and semantically coherent. BRIGHT is a recent and influential benchmark designed to evaluate complex multi‑hop retrieval across diverse, high‑reasoning domains. However, its practical effectiveness is limited by common web‑crawled artifacts ‑ such as content redundancy and semantic discontinuity ‑ that impair retrieval accuracy and downstream reasoning. Notably, we find that such issues are concentrated in seven StackExchange‑derived subdomains, while other domains (e.g., Coding and Theorem‑based content) remain relatively clean. In this study, we present MARCUS, a multi‑agent pipeline that leverages large language models (LLMs) to systematically clean and re‑chunk BRIGHT into a higher‑quality corpus: BRIGHT‑Plus. MARCUS applies dedicated agents for structural noise removal and semantic segmentation, preserving answer‑bearing spans while improving contextual integrity. Experimental evaluations demonstrate that BRIGHT‑Plus yields consistent and significant improvements in both retrieval accuracy and multi‑hop reasoning across a diverse set of retrievers. We release both the BRIGHT‑Plus corpus and the MARCUS pipeline to support future research on robust, reasoning‑centric retrieval.

Abstract:
This technical report presents the implementation details of the winning solution for the ICRA 2025 GOOSE 3D Semantic Segmentation Challenge. This challenge focuses on semantic segmentation of 3D point clouds from diverse unstructured outdoor environments collected from multiple robotic platforms. This problem was addressed by implementing Point Prompt Tuning (PPT) integrated with Point Transformer v3 (PTv3) backbone, enabling adaptive processing of heterogeneous LiDAR data through platform‑specific conditioning and cross‑dataset class alignment strategies. The model is trained without requiring additional external data. As a result, this approach achieved substantial performance improvements with mIoU increases of up to 22.59% on challenging platforms compared to the baseline PTv3 model, demonstrating the effectiveness of adaptive point cloud understanding for field robotics applications.

Abstract:
Semantic segmentation of satellite imagery is crucial for Earth observation applications, but remains constrained by limited labelled training data. While self‑supervised pretraining methods like Masked Autoencoders (MAE) have shown promise, they focus on reconstruction rather than localisation‑a fundamental aspect of segmentation tasks. We propose adapting LOCA (Location‑aware), a position prediction self‑supervised learning method, for multimodal satellite imagery semantic segmentation. Our approach addresses the unique challenges of satellite data by extending SatMAE's channel grouping from multispectral to multimodal data, enabling effective handling of multiple modalities, and introducing same‑group attention masking to encourage cross‑modal interaction during pretraining. The method uses relative patch position prediction, encouraging spatial reasoning for localisation rather than reconstruction. We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction‑based self‑supervised learning methods for satellite imagery. Our results demonstrate that position prediction tasks, when properly adapted for multimodal satellite imagery, learn representations more effective for satellite image semantic segmentation than reconstruction‑based approaches.

Abstract:
Endoscopic surgery is the gold standard for robotic‑assisted minimally invasive surgery, offering significant advantages in early disease detection and precise interventions. However, the complexity of surgical scenes, characterized by high variability in different surgical activity scenarios and confused image features between targets and the background, presents challenges for surgical environment understanding. Traditional deep learning models often struggle with cross‑activity interference, leading to suboptimal performance in each downstream task. To address this limitation, we explore multi‑task learning, which utilizes the interrelated features between tasks to enhance overall task performance. In this paper, we propose EndoARSS, a novel multi‑task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation. Built upon the DINOv2 foundation model, our approach integrates Low‑Rank Adaptation to facilitate efficient fine‑tuning while incorporating Task Efficient Shared Low‑Rank Adapters to mitigate gradient conflicts across diverse tasks. Additionally, we introduce the Spatially‑Aware Multi‑Scale Attention that enhances feature representation discrimination by enabling cross‑spatial learning of global information. In order to evaluate the effectiveness of our framework, we present three novel datasets, MTLESD, MTLEndovis and MTLEndovis‑Gen, tailored for endoscopic surgery scenarios with detailed annotations for both activity recognition and semantic segmentation tasks. Extensive experiments demonstrate that EndoARSS achieves remarkable performance across multiple benchmarks, significantly improving both accuracy and robustness in comparison to existing models. These results underscore the potential of EndoARSS to advance AI‑driven endoscopic surgical systems, offering valuable insights for enhancing surgical safety and efficiency.

Abstract:
Discovering novel classes in open‑world settings is crucial for real‑world applications. Traditional explicit representations, such as object descriptors or 3D segmentation maps, are constrained by their discrete, hole‑prone, and noisy nature, which hinders accurate novel class discovery. To address these challenges, we introduce NeurNCD, the first versatile and data‑efficient framework for novel class discovery that employs the meticulously designed Embedding‑NeRF model combined with KL divergence as a substitute for traditional explicit 3D segmentation maps to aggregate semantic embedding and entropy in visual embedding space. NeurNCD also integrates several key components, including feature query, feature modulation and clustering, facilitating efficient feature augmentation and information exchange between the pre‑trained semantic segmentation network and implicit neural representations. As a result, our framework achieves superior segmentation performance in both open and closed‑world settings without relying on densely labelled datasets for supervised training or human interaction to generate sparse label supervision. Extensive experiments demonstrate that our method significantly outperforms state‑of‑the‑art approaches on the NYUv2 and Replica datasets.

Abstract:
Semantic segmentation is critical for scene understanding but demands costly pixel‑wise annotations, attracting increasing attention to semi‑supervised approaches to leverage abundant unlabeled data. While semi‑supervised segmentation is often promoted as a path toward scalable, real‑world deployment, it is astonishing that current evaluation protocols exclusively focus on segmentation accuracy, entirely overlooking reliability and robustness. These qualities, which ensure consistent performance under diverse conditions (robustness) and well‑calibrated model confidences as well as meaningful uncertainties (reliability), are essential for safety‑critical applications like autonomous driving, where models must handle unpredictable environments and avoid sudden failures at all costs. To address this gap, we introduce the Reliable Segmentation Score (RSS), a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean. RSS penalizes deficiencies in any of its components, providing an easy and intuitive way of holistically judging segmentation models. Comprehensive evaluations of UniMatchV2 against its predecessor and a supervised baseline show that semi‑supervised methods often trade reliability for accuracy. While out‑of‑domain evaluations demonstrate UniMatchV2's robustness, they further expose persistent reliability shortcomings. We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi‑supervised learning research with real‑world deployment needs.

Abstract:
This paper addresses the problem of category‑level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi‑stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real‑time robotic tasks. To address these limitations, we propose YOEO, a single‑stage method that simultaneously outputs instance segmentation and NPCS representations in an end‑to‑end manner. We use a unified network to generate point‑wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single‑shot method. We also deploy our synthetically‑trained model in a real‑world setting, providing real‑time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method.

Abstract:
Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self‑supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image‑based models for dense prediction while outperforming them on tasks requiring fine‑grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language‑driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self‑supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Abstract:
Segmenting Synthetic Aperture Radar (SAR) images is crucial for many remote sensing applications, particularly water body detection. However, deep learning‑based segmentation models often face challenges related to convergence speed and stability, mainly due to the complex statistical distribution of this type of data. In this study, we evaluate the impact of mode normalization on two widely used semantic segmentation models, U‑Net and SegNet. Specifically, we integrate mode normalization, to reduce convergence time while maintaining the performance of the baseline models. Experimental results demonstrate that mode normalization significantly accelerates convergence. Furthermore, cross‑validation results indicate that normalized models exhibit increased stability in different zones. These findings highlight the effectiveness of normalization in improving computational efficiency and generalization in SAR image segmentation.

Abstract:
Efficient keyframe extraction is critical for video summarization and retrieval, yet capturing the full semantic and visual richness of video content remains challenging. We introduce TriPSS, a tri‑modal framework that integrates perceptual features from the CIELAB color space, structural embeddings from ResNet‑50, and semantic context from frame‑level captions generated by LLaMA‑3.2‑11B‑Vision‑Instruct. These modalities are fused using principal component analysis to form compact multi‑modal embeddings, enabling adaptive video segmentation via HDBSCAN clustering. A refinement stage incorporating quality assessment and duplicate filtering ensures the final keyframe set is both concise and semantically diverse. Evaluations on the TVSum20 and SumMe benchmarks show that TriPSS achieves state‑of‑the‑art performance, significantly outperforming both unimodal and prior multimodal approaches. These results highlight TriPSS' ability to capture complementary visual and semantic cues, establishing it as an effective solution for video summarization, retrieval, and large‑scale multimedia understanding.

Abstract:
Carbon dioxide (CO_2) emissions are critical indicators of both environmental impact and various industrial processes, including livestock management. We introduce CarboFormer, a lightweight semantic segmentation framework for Optical Gas Imaging (OGI), designed to detect and quantify CO_2 emissions across diverse applications. Our approach integrates an optimized encoder‑decoder architecture with specialized multi‑scale feature fusion and auxiliary supervision strategies to effectively model both local details and global relationships in gas plume imagery while achieving competitive accuracy with minimal computational overhead for resource‑constrained environments. We contribute two novel datasets: (1) the Controlled Carbon Dioxide Release (CCR) dataset, which simulates gas leaks with systematically varied flow rates (10‑100 SCCM), and (2) the Real Time Ankom (RTA) dataset, focusing on emissions from dairy cow rumen fluid in vitro experiments. Extensive evaluations demonstrate that CarboFormer achieves competitive performance with 84.88% mIoU on CCR and 92.98% mIoU on RTA, while maintaining computational efficiency with only 5.07M parameters and operating at 84.68 FPS. The model shows particular effectiveness in challenging low‑flow scenarios and significantly outperforms other lightweight methods like SegFormer‑B0 (83.36% mIoU on CCR) and SegNeXt (82.55% mIoU on CCR), making it suitable for real‑time monitoring on resource‑constrained platforms such as programmable drones. Our work advances both environmental sensing and precision livestock management by providing robust and efficient tools for CO_2 emission analysis.

Abstract:
We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region‑level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region‑specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi‑modal tokens for LLM comprehension. To support robust multi‑granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high‑quality dataset of 1.5M image and 0.6M video region‑semantic annotations, including novel region‑level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2‑2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real‑world applications. We believe that our effective approach will serve as a strong baseline for future research in region‑level visual understanding.

Abstract:
Training neural networks for tasks such as 3D point cloud semantic segmentation demands extensive datasets, yet obtaining and annotating real‑world point clouds is costly and labor‑intensive. This work aims to introduce a novel pipeline for generating realistic synthetic data, by leveraging 3D Gaussian Splatting (3DGS) and Gaussian Opacity Fields (GOF) to generate 3D assets of multiple different agricultural vehicles instead of using generic models. These assets are placed in a simulated environment, where the point clouds are generated using a simulated LiDAR. This is a flexible approach that allows changing the LiDAR specifications without incurring additional costs. We evaluated the impact of synthetic data on segmentation models such as PointNet++, Point Transformer V3, and OACNN, by training and validating the models only on synthetic data. Remarkably, the PTv3 model had an mIoU of 91.35%, a noteworthy result given that the model had neither been trained nor validated on any real data. Further studies even suggested that in certain scenarios the models trained only on synthetically generated data performed better than models trained on real‑world data. Finally, experiments demonstrated that the models can generalize across semantic classes, enabling accurate predictions on mesh models they were never trained on.

Abstract:
Information on trees at the individual level is crucial for monitoring forest ecosystems and planning forest management. Current monitoring methods involve ground measurements, requiring extensive cost, time and labor. Advances in drone remote sensing and computer vision offer great potential for mapping individual trees from aerial imagery at broad‑scale. Large pre‑trained vision models, such as the Segment Anything Model (SAM), represent a particularly compelling choice given limited labeled data. In this work, we compare methods leveraging SAM for the task of automatic tree crown instance segmentation in high resolution drone imagery in three use cases: 1) boreal plantations, 2) temperate forests and 3) tropical forests. We also study the integration of elevation data into models, in the form of Digital Surface Model (DSM) information, which can readily be obtained at no additional cost from RGB drone imagery. We present BalSAM, a model leveraging SAM and DSM information, which shows potential over other methods, particularly in the context of plantations. We find that methods using SAM out‑of‑the‑box do not outperform a custom Mask R‑CNN, even with well‑designed prompts. However, efficiently tuning SAM end‑to‑end and integrating DSM information are both promising avenues for tree crown instance segmentation models.

Abstract:
We introduce CzechLynx, the first large‑scale, open‑access dataset for individual identification, pose estimation, and instance segmentation of the Eurasian lynx (Lynx lynx). CzechLynx contains 39,760 camera trap images annotated with segmentation masks, identity labels, and 20‑point skeletons and covers 319 unique individuals across 15 years of systematic monitoring in two geographically distinct regions: southwest Bohemia and the Western Carpathians. In addition to the real camera trap data, we provide a large complementary set of photorealistic synthetic images and a Unity‑based generation pipeline with diffusion‑based text‑to‑texture modeling, capable of producing arbitrarily large amounts of synthetic data spanning diverse environments, poses, and coat‑pattern variations. To enable systematic testing across realistic ecological scenarios, we define three complementary evaluation protocols: (i) geo‑aware, (ii) time‑aware open‑set, and (iii) time‑aware closed‑set, covering cross‑regional and long‑term monitoring settings. With the provided resources, CzechLynx offers a unique, flexible benchmark for robust evaluation of computer vision and machine learning models across realistic ecological scenarios.

Abstract:
The main goal of representation learning is to acquire meaningful representations from real‑world sensory inputs without supervision. Representation learning explains some aspects of human development. Various neural network (NN) models have been proposed that acquire empirically good representations. However, the formulation of a good representation has not been established. We recently proposed a method for categorizing changes between a pair of sensory inputs. A unique feature of this approach is that transformations between two sensory inputs are learned to satisfy algebraic structural constraints. Conventional representation learning often assumes that disentangled independent feature axes is a good representation; however, we found that such a representation cannot account for conditional independence. To overcome this problem, we proposed a new method using group decomposition in Galois algebra theory. Although this method is promising for defining a more general representation, it assumes pixel‑to‑pixel translation without feature extraction, and can only process low‑resolution images with no background, which prevents real‑world application. In this study, we provide a simple method to apply our group decomposition theory to a more realistic scenario by combining feature extraction and object segmentation. We replace pixel translation with feature translation and formulate object segmentation as grouping features under the same transformation. We validated the proposed method on a practical dataset containing both real‑world object and background. We believe that our model will lead to a better understanding of human development of object recognition in the real world.

Abstract:
The title of this paper is perhaps an overclaim. Of course, the process of creating and optimizing a learned model inevitably involves multiple training runs which potentially feature different architectural designs, input and output encodings, and losses. However, our method, You Only Train Once (YOTO), indeed contributes to limiting training to one shot for the latter aspect of losses selection and weighting. We achieve this by automatically optimizing loss weight hyperparameters of learned models in one shot via standard gradient‑based optimization, treating these hyperparameters as regular parameters of the networks and learning them. To this end, we leverage the differentiability of the composite loss formulation which is widely used for optimizing multiple empirical losses simultaneously and model it as a novel layer which is parameterized with a softmax operation that satisfies the inherent positivity constraints on loss hyperparameters while avoiding degenerate empirical gradients. We complete our joint end‑to‑end optimization scheme by defining a novel regularization loss on the learned hyperparameters, which models a uniformity prior among the employed losses while ensuring boundedness of the identified optima. We evidence the efficacy of YOTO in jointly optimizing loss hyperparameters and regular model parameters in one shot by comparing it to the commonly used brute‑force grid search across state‑of‑the‑art networks solving two key problems in computer vision, i.e. 3D estimation and semantic segmentation, and showing that it consistently outperforms the best grid‑search model on unseen test data. Code will be made publicly available.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) involves assigning labels to each pixel in an image based on textual descriptions, leveraging world models like CLIP. However, they encounter significant challenges in cross‑domain generalization, hindering their practical efficacy in real‑world applications. Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities, and in this study, we present AetherVision‑Bench, a benchmark for multi‑angle segmentation across aerial, and ground perspectives, which facilitates an extensive evaluation of performance across different viewing angles and sensor modalities. We assess state‑of‑the‑art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero‑shot transfer models. Our work pioneers the creation of a robustness benchmark, offering valuable insights and establishing a foundation for future research.

Abstract:
Utilizing multi‑modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi‑modal semantic segmentation as a mask‑level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi‑modal inputs into RGB and X, where X represents any non‑RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well‑established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality's strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real‑world multi‑modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.

Abstract:
Environmental soundscapes convey substantial ecological and social information regarding urban environments; however, their potential remains largely untapped in large‑scale geographic analysis. In this study, we investigate the extent to which urban sounds correspond with visual scenes by comparing various visual representation strategies in capturing acoustic semantics. We employ a multimodal approach that integrates geo‑referenced sound recordings with both street‑level and remote sensing imagery across three major global cities: London, New York, and Tokyo. Utilizing the AST model for audio, along with CLIP and RemoteCLIP for imagery, as well as CLIPSeg and Seg‑Earth OV for semantic segmentation, we extract embeddings and class‑level features to evaluate cross‑modal similarity. The results indicate that street view embeddings demonstrate stronger alignment with environmental sounds compared to segmentation outputs, whereas remote sensing segmentation is more effective in interpreting ecological categories through a Biophony‑‑Geophony‑‑Anthrophony (BGA) framework. These findings imply that embedding‑based models offer superior semantic alignment, while segmentation‑based methods provide interpretable links between visual structure and acoustic ecology. This work advances the burgeoning field of multimodal urban sensing by offering novel perspectives for incorporating sound into geospatial analysis.

Abstract:
Existing LiDAR semantic segmentation models often suffer from decreased accuracy when exposed to adverse weather conditions. Recent methods addressing this issue focus on enhancing training data through weather simulation or universal augmentation techniques. However, few works have studied the negative impacts caused by the heterogeneous domain shifts in the geometric structure and reflectance intensity of point clouds. In this paper, we delve into this challenge and address it with a novel Geometry‑Reflectance Collaboration (GRC) framework that explicitly separates feature extraction for geometry and reflectance. Specifically, GRC employs a dual‑branch architecture designed to independently process geometric and reflectance features initially, thereby capitalizing on their distinct characteristic. Then, GRC adopts a robust multi‑level feature collaboration module to suppress redundant and unreliable information from both branches. Consequently, without complex simulation or augmentation, our method effectively extracts intrinsic information about the scene while suppressing interference, thus achieving better robustness and generalization in adverse weather conditions. We demonstrate the effectiveness of GRC through comprehensive experiments on challenging benchmarks, showing that our method outperforms previous approaches and establishes new state‑of‑the‑art results.

Abstract:
This paper introduces GeoChain, a large‑scale benchmark for evaluating step‑by‑step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street‑level images, GeoChain pairs each image with a 21‑step chain‑of‑thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine‑grained localization across four reasoning categories ‑ visual, spatial, cultural, and precise geolocation ‑ annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT‑4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088‑image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.

Abstract:
Since the point cloud data is inherently irregular and unstructured, point cloud semantic segmentation has always been a challenging task. The graph‑based method attempts to model the irregular point cloud by representing it as a graph; however, this approach incurs substantial computational cost due to the necessity of constructing a graph for every point within a large‑scale point cloud. In this paper, we observe that boundary points possess more intricate spatial structural information and develop a novel graph attention network known as the Boundary‑Aware Graph attention Network (BAGNet). On one hand, BAGNet contains a boundary‑aware graph attention layer (BAGLayer), which employs edge vertex fusion and attention coefficients to capture features of boundary points, reducing the computation time. On the other hand, BAGNet employs a lightweight attention pooling layer to extract the global feature of the point cloud to maintain model accuracy. Extensive experiments on standard datasets demonstrate that BAGNet outperforms state‑of‑the‑art methods in point cloud semantic segmentation with higher accuracy and less inference time.

Abstract:
Robot manipulation, especially bimanual manipulation, often requires setting up multiple cameras on multiple robot manipulators. Before robot manipulators can generate motion or even build representations of their environments, the cameras rigidly mounted to the robot need to be calibrated. Camera calibration is a cumbersome process involving collecting a set of images, with each capturing a pre‑determined marker. In this work, we introduce the Bi‑Manual Joint Calibration and Representation Framework (Bi‑JCR). Bi‑JCR enables multiple robot manipulators, each with cameras mounted, to circumvent taking images of calibration markers. By leveraging 3D foundation models for dense, marker‑free multi‑view correspondence, Bi‑JCR jointly estimates: (i) the extrinsic transformation from each camera to its end‑effector, (ii) the inter‑arm relative poses between manipulators, and (iii) a unified, scale‑consistent 3D representation of the shared workspace, all from the same captured RGB image sets. The representation, jointly constructed from images captured by cameras on both manipulators, lives in a common coordinate frame and supports collision checking and semantic segmentation to facilitate downstream bimanual coordination tasks. We empirically evaluate the robustness of Bi‑JCR on a variety of tabletop environments, and demonstrate its applicability on a variety of downstream tasks.

Abstract:
Transformers have been seldom employed in point cloud roof plane instance segmentation, which is the focus of this study, and existing superpoint Transformers suffer from limited performance due to the use of low‑quality superpoints. To address this challenge, we establish two criteria that high‑quality superpoints for Transformers should satisfy and introduce a corresponding two‑stage superpoint generation process. The superpoints generated by our method not only have accurate boundaries, but also exhibit consistent geometric sizes and shapes, both of which greatly benefit the feature learning of superpoint Transformers. To compensate for the limitations of deep learning features when the training set size is limited, we incorporate multidimensional handcrafted features into the model. Additionally, we design a decoder that combines a Kolmogorov‑Arnold Network with a Transformer module to improve instance prediction and mask extraction. Finally, our network's predictions are refined using traditional algorithm‑based postprocessing. For evaluation, we annotated a real‑world dataset and corrected annotation errors in the existing RoofN3D dataset. Experimental results show that our method achieves state‑of‑the‑art performance on our dataset, as well as both the original and reannotated RoofN3D datasets. Moreover, our model is not sensitive to plane boundary annotations during training, significantly reducing the annotation burden. Through comprehensive experiments, we also identified key factors influencing roof plane segmentation performance: in addition to roof types, variations in point cloud density, density uniformity, and 3D point precision have a considerable impact. These findings underscore the importance of incorporating data augmentation strategies that account for point cloud quality to enhance model robustness under diverse and challenging conditions.

Abstract:
Multi‑modal RGB and Depth (RGBD) data are predominant in many domains such as robotics, autonomous driving and remote sensing. The combination of these multi‑modal data enhances environmental perception by providing 3D spatial context, which is absent in standard RGB images. Although RGBD multi‑modal data can be available to train computer vision models, accessing all sensor modalities during the inference stage may be infeasible due to sensor failures or resource constraints, leading to a mismatch between data modalities available during training and inference. Traditional Cross‑Modal Knowledge Distillation (CMKD) frameworks, developed to address this task, are typically based on a teacher/student paradigm, where a multi‑modal teacher distills knowledge into a single‑modality student model. However, these approaches face challenges in teacher architecture choices and distillation process selection, thus limiting their adoption in real‑world scenarios. To overcome these issues, we introduce CroDiNo‑KD (Cross‑Modal Disentanglement: a New Outlook on Knowledge Distillation), a novel cross‑modal knowledge distillation framework for RGBD semantic segmentation. Our approach simultaneously learns single‑modality RGB and Depth models by exploiting disentanglement representation, contrastive learning and decoupled data augmentation with the aim to structure the internal manifolds of neural network models through interaction and collaboration. We evaluated CroDiNo‑KD on three RGBD datasets across diverse domains, considering recent CMKD frameworks as competitors. Our findings illustrate the quality of CroDiNo‑KD, and they suggest reconsidering the conventional teacher/student paradigm to distill information from multi‑modal data to single‑modality neural networks.

Abstract:
Semantic segmentation of crops and weeds is crucial for site‑specific farm management; however, most existing methods depend on labor intensive pixel‑level annotations. A further challenge arises when models trained on one field (source domain) fail to generalize to new fields (target domain) due to domain shifts, such as variations in lighting, camera setups, soil composition, and crop growth stages. Unsupervised Domain Adaptation (UDA) addresses this by enabling adaptation without target‑domain labels, but current UDA methods struggle with occlusions and visual blending between crops and weeds, leading to misclassifications in real‑world conditions. To overcome these limitations, we introduce MaskAdapt, a novel approach that enhances segmentation accuracy through multimodal contextual learning by integrating RGB images with features derived from depth data. By computing depth gradients from depth maps, our method captures spatial transitions that help resolve texture ambiguities. These gradients, through a cross‑attention mechanism, refines RGB feature representations, resulting in sharper boundary delineation. In addition, we propose a geometry‑aware masking strategy that applies horizontal, vertical, and stochastic masks during training. This encourages the model to focus on the broader spatial context for robust visual recognition. Evaluations on real agricultural datasets demonstrate that MaskAdapt consistently outperforms existing State‑of‑the‑Art (SOTA) UDA methods, achieving improved segmentation mean Intersection over Union (mIOU) across diverse field conditions.

Abstract:
While massively scaling both data and models have become central in NLP and 2D vision, their benefits for 3D point cloud understanding remain limited. We study the initial step of scaling 3D point cloud understanding under a realistic regime: large‑scale multi‑dataset joint training for 3D semantic segmentation, with no dataset labels available at training or inference time. Point clouds arise from a wide range of sensors (e.g., depth cameras, LiDAR) and scenes (\eg, indoor, outdoor), yielding heterogeneous scanning patterns, sampling densities, and semantic biases; naively mixing such datasets degrades standard models. Therefore, we introduce Point‑MoE, a Mixture‑of‑Experts design that expands model capacity through sparsely activated expert MLPs and a lightweight top‑k router, allowing tokens to select specialized experts without requiring dataset supervision. Trained jointly on a diverse mix of indoor and outdoor datasets, and evaluated on seen datasets as well as in zero‑shot settings, Point‑MoE outperforms prior methods without using dataset labels for either training or inference. This outlines a scalable path for 3D perception: letting the model discover structure in heterogeneous 3D data rather than imposing it via manual curation or dataset‑specific heuristics.

Abstract:
The accurate semantic segmentation of tree crowns within remotely sensed data is crucial for scientific endeavours such as forest management, biodiversity studies, and carbon sequestration quantification. However, precise segmentation remains challenging due to complexities in the forest canopy, including shadows, intricate backgrounds, scale variations, and subtle spectral differences among tree species. Compared to the traditional methods, Deep Learning models improve accuracy by extracting informative and discriminative features, but often fall short in capturing the aforementioned complexities. To address these challenges, we propose PerceptiveNet, a novel model incorporating a Logarithmic Gabor‑parameterised convolutional layer with trainable filter parameters, alongside a backbone that extracts salient features while capturing extensive context and spatial information through a wider receptive field. We investigate the impact of Log‑Gabor, Gabor, and standard convolutional layers on semantic segmentation performance through extensive experimentation. Additionally, we conduct an ablation study to assess the contributions of individual layers and their combinations to overall model performance, and we evaluate PerceptiveNet as a backbone within a novel hybrid CNN‑Transformer model. Our results outperform state‑of‑the‑art models, demonstrating significant performance improvements on a tree crown dataset while generalising across domains, including two benchmark aerial scene semantic segmentation datasets with varying complexities.

Abstract:
Image‑based virtual try‑on aims to fit a target garment to a specific person image and has attracted extensive research attention because of its huge application potential in the e‑commerce and fashion industries. To generate high‑quality try‑on results, accurately warping the clothing item to fit the human body plays a significant role, as slight misalignment may lead to unrealistic artifacts in the fitting image. Most existing methods warp the clothing by feature matching and thin‑plate spline (TPS). However, it often fails to preserve clothing details due to self‑occlusion, severe misalignment between poses, etc. To address these challenges, this paper proposes a detail retention virtual try‑on method via accurate non‑rigid registration (VITON‑DRR) for diverse human poses. Specifically, we reconstruct a human semantic segmentation using a dual‑pyramid‑structured feature extractor. Then, a novel Deformation Module is designed for extracting the cloth key points and warping them through an accurate non‑rigid registration algorithm. Finally, the Image Synthesis Module is designed to synthesize the deformed garment image and generate the human pose information adaptively. Compared with traditional methods, the proposed VITON‑DRR can make the deformation of fitting images more accurate and retain more garment details. The experimental results demonstrate that the proposed method performs better than state‑of‑the‑art methods.

Abstract:
In semi‑supervised semantic segmentation (SSSS), data augmentation plays a crucial role in the weak‑to‑strong consistency regularization framework, as it enhances diversity and improves model generalization. Recent strong augmentation methods have primarily focused on intensity‑based perturbations, which have minimal impact on the semantic masks. In contrast, spatial augmentations like translation and rotation have long been acknowledged for their effectiveness in supervised semantic segmentation tasks, but they are often ignored in SSSS. In this work, we demonstrate that spatial augmentation can also contribute to model training in SSSS, despite generating inconsistent masks between the weak and strong augmentations. Furthermore, recognizing the variability among images, we propose an adaptive augmentation strategy that dynamically adjusts the augmentation for each instance based on entropy. Extensive experiments show that our proposed Adaptive Spatial Augmentation (ASAug) can be integrated as a pluggable module, consistently improving the performance of existing methods and achieving state‑of‑the‑art results on benchmark datasets such as PASCAL VOC 2012, Cityscapes, and COCO.

Abstract:
Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter‑heavy designs and a reliance on computationally intensive Vision Transformer‑based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well‑defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO‑Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.

Abstract:
In this study, we present a novel LiDAR‑based semantic segmentation framework tailored for autonomous forklifts operating in complex outdoor environments. Central to our approach is the integration of a dual LiDAR system, which combines forward‑facing and downward‑angled LiDAR sensors to enable comprehensive scene understanding, specifically tailored for industrial material handling tasks. The dual configuration improves the detection and segmentation of dynamic and static obstacles with high spatial precision. Using high‑resolution 3D point clouds captured from two sensors, our method employs a lightweight yet robust approach that segments the point clouds into safety‑critical instance classes such as pedestrians, vehicles, and forklifts, as well as environmental classes such as driveable ground, lanes, and buildings. Experimental validation demonstrates that our approach achieves high segmentation accuracy while satisfying strict runtime requirements, establishing its viability for safety‑aware, fully autonomous forklift navigation in dynamic warehouse and yard environments.

Abstract:
Coral reefs, crucial for sustaining marine biodiversity and ecological processes (e.g., nutrient cycling, habitat provision), face escalating threats, underscoring the need for efficient monitoring. Coral reef ecological monitoring faces dual challenges of low efficiency in manual analysis and insufficient segmentation accuracy in complex underwater scenarios. This study develops the YH‑MINER system, establishing an intelligent framework centered on the Multimodal Large Model (MLLM) for "object detection‑semantic segmentation‑prior input". The system uses the object detection module (mAP@0.5=0.78) to generate spatial prior boxes for coral instances, driving the segment module to complete pixel‑level segmentation in low‑light and densely occluded scenarios. The segmentation masks and finetuned classification instructions are fed into the Qwen2‑VL‑based multimodal model as prior inputs, achieving a genus‑level classification accuracy of 88% and simultaneously extracting core ecological metrics. Meanwhile, the system retains the scalability of the multimodal model through standardized interfaces, laying a foundation for future integration into multimodal agent‑based underwater robots and supporting the full‑process automation of "image acquisition‑prior generation‑real‑time analysis".

Abstract:
Weakly supervised semantic segmentation (WSSS) in medical imaging struggles with effectively using sparse annotations. One promising direction for WSSS leverages gaze annotations, captured via eye trackers that record regions of interest during diagnostic procedures. However, existing gaze‑based methods, such as GazeMedSeg, do not fully exploit the rich information embedded in gaze data. In this paper, we propose GradTrack, a framework that utilizes physicians' gaze track, including fixation points, durations, and temporal order, to enhance WSSS performance. GradTrack comprises two key components: Gaze Track Map Generation and Track Attention, which collaboratively enable progressive feature refinement through multi‑level gaze supervision during the decoding process. Experiments on the Kvasir‑SEG and NCI‑ISBI datasets demonstrate that GradTrack consistently outperforms existing gaze‑based methods, achieving Dice score improvements of 3.21% and 2.61%, respectively. Moreover, GradTrack significantly narrows the performance gap with fully supervised models such as nnUNet.

Abstract:
Semantic segmentation is one of the most fundamental tasks in image understanding with a long history of research, and subsequently a myriad of different approaches. Traditional methods strive to train models up from scratch, requiring vast amounts of computational resources and training data. In the advent of moving to open‑vocabulary semantic segmentation, which asks models to classify beyond learned categories, large quantities of finely annotated data would be prohibitively expensive. Researchers have instead turned to training‑free methods where they leverage existing models made for tasks where data is more easily acquired. Specifically, this survey will cover the history, nuance, idea development and the state‑of‑the‑art in training‑free open‑vocabulary semantic segmentation that leverages existing multi‑modal classification models. We will first give a preliminary on the task definition followed by an overview of popular model archetypes and then spotlight over 30 approaches split into broader research branches: purely CLIP‑based, those leveraging auxiliary visual foundation models and ones relying on generative methods. Subsequently, we will discuss the limitations and potential problems of current research, as well as provide some underexplored ideas for future study. We believe this survey will serve as a good onboarding read to new researchers and spark increased interest in the area.

Abstract:
Vision Transformer (ViT) has made significant advancements in computer vision, thanks to its token mixer's sophisticated ability to capture global dependencies between all tokens. However, the quadratic growth in computational demands as the number of tokens increases limits its practical efficiency. Although recent methods have combined the strengths of convolutions and self‑attention to achieve better trade‑offs, the expensive pairwise token affinity and complex matrix operations inherent in self‑attention remain a bottleneck. To address this challenge, we propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self‑Attention (SSA). We design simple yet effective Hybrid Perception Blocks (HPBs) to effectively integrate the local perception capabilities of CNNs with the global context modeling of Transformer's attention mechanisms. A key innovation of SSA lies in its reduction of the spatial dimensions of K and V, while compressing the channel dimensions of Q and K. This design significantly reduces computational overhead while preserving accuracy, striking an optimal balance between efficiency and effectiveness. We evaluate the robustness and efficiency of S2AFormer through extensive experiments on multiple vision benchmarks, including ImageNet‑1k for image classification, ADE20k for semantic segmentation, and COCO for object detection and instance segmentation. Results demonstrate that S2AFormer achieves significant accuracy gains with superior efficiency in both GPU and non‑GPU environments, making it a strong candidate for efficient vision Transformers.

Abstract:
Instance segmentation demands costly per‑pixel annotations and computationally expensive models. We introduce CAST, a semi‑supervised knowledge distillation (SSKD) framework that compresses pre‑trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self‑training with contrastive calibration, (2) knowledge transfer through a unified multi‑objective loss, and (3) student refinement to mitigate residual pseudo‑label bias. Central to CAST is an \emphinstance‑aware pixel‑wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter‑instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11x smaller student improves over its zero‑shot VFM teacher(s) by +8.5 and +7.1 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and further outperforms state‑of‑the‑art SSKD methods on both benchmarks.

Abstract:
Learning visual representations from observing actions to benefit robot visuo‑motor policy generation is a promising direction that closely resembles human cognitive function and perception. Motivated by this, and further inspired by psychological theories suggesting that humans process scenes in an object‑based fashion, we propose an object‑centric encoder that performs semantic segmentation and visual representation generation in a coupled manner, unlike other works, which treat these as separate processes. To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out‑of‑domain datasets, to bootstrap fine‑tuning on human action video data. Through simulated robotic tasks, we demonstrate that visual representations can enhance reinforcement and imitation learning training, highlighting the effectiveness of our integrated approach for semantic segmentation and encoding. Furthermore, we show that exploiting models pretrained on out‑of‑domain datasets can benefit this process, and that fine‑tuning on datasets depicting human actions ‑‑ although still out‑of‑domain ‑‑ , can significantly improve performance due to close alignment with robotic tasks. These findings show the capability to reduce reliance on annotated or robot‑specific action datasets and the potential to build on existing visual encoders to accelerate training and improve generalizability.

Abstract:
Camera‑based 3D semantic occupancy prediction offers an efficient and cost‑effective solution for perceiving surrounding scenes in autonomous driving. However, existing works rely on explicit occupancy state inference, leading to numerous incorrect feature assignments, and insufficient samples restrict the learning of occupancy class inference. To address these challenges, we propose leveraging Depth awareness and Semantic aid to boost camera‑based 3D semantic Occupancy prediction (DSOcc). We jointly perform occupancy state and occupancy class inference, where soft occupancy confidence is calculated by non‑learning method and multiplied with image features to make voxels aware of depth, enabling adaptive implicit occupancy state inference. Instead of enhancing feature learning, we directly utilize well‑trained image semantic segmentation and fuse multiple frames with their occupancy probabilities to aid occupancy class inference, thereby enhancing robustness. Experimental results demonstrate that DSOcc achieves state‑of‑the‑art performance on the SemanticKITTI dataset among camera‑based methods and achieves competitive performance on the SSCBench‑KITTI‑360 and Occ3D‑nuScenes datasets. Code will be released on github.

Abstract:
Landing safely in crowded urban environments remains an essential yet challenging endeavor for Unmanned Aerial Vehicles (UAVs), especially in emergency situations. In this work, we propose a risk‑aware approach that harnesses semantic segmentation to continuously evaluate potential hazards in the drone's field of view. By using a specialized deep neural network to assign pixel‑level risk values and applying an algorithm based on risk maps, our method adaptively identifies a stable Safe Landing Zone (SLZ) despite moving critical obstacles such as vehicles, people, etc., and other visual challenges like shifting illumination. A control system then guides the UAV toward this low‑risk region, employing altitude‑dependent safety thresholds and temporal landing point stabilization to ensure robust descent trajectories. Experimental validation in diverse urban environments demonstrates the effectiveness of our approach, achieving over 90% landing success rates in very challenging real scenarios, showing significant improvements in various risk metrics. Our findings suggest that risk‑oriented vision methods can effectively help reduce the risk of accidents in emergency landing situations, particularly in complex, unstructured, urban scenarios, densely populated with moving risky obstacles, while potentiating the true capabilities of UAVs in complex urban operations.

Abstract:
Accurate parameterization of rooftop photovoltaic (PV) installations is critical for effective grid management and strategic large‑scale solar deployment. The lack of high‑fidelity datasets for PV configuration parameters often compels practitioners to rely on coarse assumptions, undermining both the temporal and numerical accuracy of large‑scale PV performance modeling. This study introduces a fully automated framework that innovatively integrates remote sensing data, semantic segmentation, polygon‑vector refinement, tilt‑azimuth estimation, and module layout inference to produce a richly attributed GIS dataset of distributed PV. Applied to Eindhoven (the Netherlands), the method achieves a correlation (R^2) of 0.92 with Distribution System Operator (DSO) records, while capacity estimates for 73% of neighborhoods demonstrate agreement within a \pm25% margin of recorded data. Additionally, by accurately capturing actual system configuration parameters (e.g., tilt, azimuth, module layout) and seamlessly linking them to advanced performance models, the method yields more reliable PV energy generation forecasts within the distribution networks. Centering our experiments toward a high PV‑penetration community, configuration‑aware simulations help to reduce Mean Absolute Percentage Error (MAPE) of energy generation modeling by up to 160% compared to the conventional assumption‑based approaches. Furthermore, owing to its modular design and reliance on readily available geospatial resources, the workflow can be extended across diverse regions, offering a scalable solution for robust urban solar integration.

Abstract:
Satellite Image Time Series (SITS) is crucial for agricultural semantic segmentation. However, Cloud contamination introduces time gaps in SITS, disrupting temporal dependencies and causing feature shifts, leading to degraded performance of models trained on complete SITS. Existing methods typically address this by reconstructing the entire SITS before prediction or using data augmentation to simulate missing data. Yet, full reconstruction may introduce noise and redundancy, while the data‑augmented model can only handle limited missing patterns, leading to poor generalization. We propose a joint learning framework with feature reconstruction and prediction to address incomplete SITS more effectively. During training, we simulate data‑missing scenarios using temporal masks. The two tasks are guided by both ground‑truth labels and the teacher model trained on complete SITS. The prediction task constrains the model from selectively reconstructing critical features from masked inputs that align with the teacher's temporal feature representations. It reduces unnecessary reconstruction and limits noise propagation. By integrating reconstructed features into the prediction task, the model avoids learning shortcuts and maintains its ability to handle varied missing patterns and complete SITS. Experiments on SITS from Hunan Province, Western France, and Catalonia show that our method improves mean F1‑scores by 6.93% in cropland extraction and 7.09% in crop classification over baselines. It also generalizes well across satellite sensors, including Sentinel‑2 and PlanetScope, under varying temporal missing rates and model backbones.

Abstract:
Accurate tumour segmentation is vital for various targeted diagnostic and therapeutic procedures for cancer, e.g., planning biopsies or tumour ablations. Manual delineation is extremely labour‑intensive, requiring substantial expert time. Fully‑supervised machine learning models aim to automate such localisation tasks, but require a large number of costly and often subjective 3D voxel‑level labels for training. The high‑variance and subjectivity in such labels impacts model generalisability, even when large datasets are available. Histopathology labels may offer more objective labels but the infeasibility of acquiring pixel‑level annotations to develop tumour localisation methods based on histology remains challenging in‑vivo. In this work, we propose a novel weakly‑supervised semantic segmentation framework called SPARS (Self‑Play Adversarial Reinforcement Learning for Segmentation), which utilises an object presence classifier, trained on a small number of image‑level binary cancer presence labels, to localise cancerous regions on CT scans. Such binary labels of patient‑level cancer presence can be sourced more feasibly from biopsies and histopathology reports, enabling a more objective cancer localisation on medical images. Evaluating with real patient data, we observed that SPARS yielded a mean dice score of 77.3 \pm 9.4, which outperformed other weakly‑supervised methods by large margins. This performance was comparable with recent fully‑supervised methods that require voxel‑level annotations. Our results demonstrate the potential of using SPARS to reduce the need for extensive human‑annotated labels to detect cancer in real‑world healthcare settings.

Abstract:
We present a novel active learning framework for 3D point cloud semantic segmentation that, for the first time, integrates large language models (LLMs) to construct hierarchical label structures and guide uncertainty‑based sample selection. Unlike prior methods that treat labels as flat and independent, our approach leverages LLM prompting to automatically generate multi‑level semantic taxonomies and introduces a recursive uncertainty projection mechanism that propagates uncertainty across hierarchy levels. This enables spatially diverse, label‑aware point selection that respects the inherent semantic structure of 3D scenes. Experiments on S3DIS and ScanNet v2 show that our method achieves up to 4% mIoU improvement under extremely low annotation budgets (e.g., 0.02%), substantially outperforming existing baselines. Our results highlight the untapped potential of LLMs as knowledge priors in 3D vision and establish hierarchical uncertainty modeling as a powerful paradigm for efficient point cloud annotation.

Abstract:
In real‑world scenarios, pixel‑level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward‑based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel‑level and image‑level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair‑wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image‑level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image‑level reward, outperforms existing weakly supervised methods that also rely solely on image‑level signals during training.

Abstract:
High‑resolution remote sensing (HRRS) image segmentation is challenging due to complex spatial layouts and diverse object appearances. While CNNs excel at capturing local features, they struggle with long‑range dependencies, whereas Transformers can model global context but often neglect local details and are computationally expensive.We propose a novel approach, Region‑Aware Proxy Network (RAPNet), which consists of two components: Contextual Region Attention (CRA) and Global Class Refinement (GCR). Unlike traditional methods that rely on grid‑based layouts, RAPNet operates at the region level for more flexible segmentation. The CRA module uses a Transformer to capture region‑level contextual dependencies, generating a Semantic Region Mask (SRM). The GCR module learns a global class attention map to refine multi‑class information, combining the SRM and attention map for accurate segmentation.Experiments on three public datasets show that RAPNet outperforms state‑of‑the‑art methods, achieving superior multi‑class segmentation accuracy.

Abstract:
Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN(R) are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time‑consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low‑cost, generalized and efficient manner. For this, the U‑Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg‑Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named "Others", while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.

Abstract:
Segment Anything Models (SAM) have achieved remarkable success in object segmentation tasks across diverse datasets. However, these models are predominantly trained on large‑scale semantic segmentation datasets, which introduce a bias toward object shape rather than texture cues in the image. This limitation is critical in domains such as medical imaging, material classification, and remote sensing, where texture changes define object boundaries. In this study, we investigate SAM's bias toward semantics over textures and introduce a new texture‑aware foundation model, TextureSAM, which performs superior segmentation in texture‑dominant scenarios. To achieve this, we employ a novel fine‑tuning approach that incorporates texture augmentation techniques, incrementally modifying training images to emphasize texture features. By leveraging a novel texture‑alternation of the ADE20K dataset, we guide TextureSAM to prioritize texture‑defined regions, thereby mitigating the inherent shape bias present in the original SAM model. Our extensive experiments demonstrate that TextureSAM significantly outperforms SAM‑2 on both natural (+0.2 mIoU) and synthetic (+0.18 mIoU) texture‑based segmentation datasets. The code and texture‑augmented dataset will be publicly available.

Abstract:
According to the EPA, only 25% of waste is recycled, and just 60% of U.S. municipalities offer curbside recycling. Plastics fare worse, with a recycling rate of only 8%; an additional 16% is incinerated, while the remaining 76% ends up in landfills. The low plastic recycling rate stems from contamination, poor economic incentives, and technical difficulties, making efficient recycling a challenge. To improve recovery, automated sorting plays a critical role. Companies like AMP Robotics and Greyparrot utilize optical systems for sorting, while Materials Recovery Facilities (MRFs) employ Near‑Infrared (NIR) sensors to detect plastic types. Modern optical sorting uses advances in computer vision such as object recognition and instance segmentation, powered by machine learning. Two‑stage detectors like Mask R‑CNN use region proposals and classification with deep backbones like ResNet. Single‑stage detectors like YOLO handle detection in one pass, trading some accuracy for speed. While such methods excel under ideal conditions with a large volume of labeled training data, challenges arise in realistic scenarios, emphasizing the need to further examine the efficacy of optic detection for automated sorting. In this study, we compiled novel datasets totaling 20,000+ images from varied sources. Using both public and custom machine learning pipelines, we assessed the capabilities and limitations of optical recognition for sorting. Grad‑CAM, saliency maps, and confusion matrices were employed to interpret model behavior. We perform this analysis on our custom trained models from the compiled datasets. To conclude, our findings are that optic recognition methods have limited success in accurate sorting of real‑world plastics at MRFs, primarily because they rely on physical properties such as color and shape.

Abstract:
While most people associate LiDAR primarily with its ability to measure distances and provide geometric information about the environment (via point clouds), LiDAR also captures additional data, including reflectivity or intensity values. Unfortunately, when LiDAR is applied to Place Recognition (PR) in mobile robotics, most previous works on LiDAR‑based PR rely only on geometric measurements, neglecting the additional reflectivity information that LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named RE‑TRIP (REflectivity‑instance augmented TRIangle descriPtor). This new descriptor leverages both geometric measurements and reflectivity to enhance robustness in challenging scenarios such as geometric degeneracy, high geometric similarity, and the presence of dynamic objects. To implement RE‑TRIP in real‑world applications, we further propose (1) a keypoint extraction method, (2) a key instance segmentation method, (3) a RE‑TRIP matching method, and (4) a reflectivity‑combined loop verification method. Finally, we conduct a series of experiments to demonstrate the effectiveness of RE‑TRIP. Applied to public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios such as long corridors, bridges, large‑scale urban areas, and highly dynamic environments ‑‑ our experimental results show that the proposed method outperforms existing state‑of‑the‑art methods in terms of Scan Context, Intensity Scan Context, and STD.

Abstract:
Large‑scale pretrained vision backbones have transformed computer vision by providing powerful feature extractors that enable various downstream tasks, including training‑free approaches like visual prompting for semantic segmentation. Despite their success in generic scenarios, these models often fall short when applied to specialized technical domains where the visual features differ significantly from their training distribution. To bridge this gap, we introduce VP Lab, a comprehensive iterative framework that enhances visual prompting for robust segmentation model development. At the core of VP Lab lies E‑PEFT, a novel ensemble of parameter‑efficient fine‑tuning techniques specifically designed to adapt our visual prompting pipeline to specific domains in a manner that is both parameter‑ and data‑efficient. Our approach not only surpasses the state‑of‑the‑art in parameter‑efficient fine‑tuning for the Segment Anything Model (SAM), but also facilitates an interactive, near‑real‑time loop, allowing users to observe progressively improving results as they experiment within the framework. By integrating E‑PEFT with visual prompting, we demonstrate a remarkable 50% increase in semantic segmentation mIoU performance across various technical datasets using only 5 validated images, establishing a new paradigm for fast, efficient, and interactive model deployment in new, challenging domains. This work comes in the form of a demonstration.

Abstract:
3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state‑of‑the‑art 3D models are prone to severe domain shift when deployed across different datasets. In this paper, we propose an Unsupervised Domain Adaptation approach where a 3D segmentation model is trained on the target dataset using pseudo‑labels generated by a novel multi‑view projection framework. Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create large‑scale synthetic 2D semantic segmentation datasets in various modalities. The generated datasets are used to train an ensemble of 2D segmentation models in point cloud view domain on each modality. During inference, the models process a large amount of views per scene; the resulting logits are back‑projected to 3D with a depth‑aware voting scheme to generate final point‑wise labels. These labels are then used to fine‑tune a 3D segmentation model in the target domain. We evaluate our approach Real‑to‑Real on the nuScenes and SemanticKITTI datasets. We also evaluate it Simulation‑to‑Real with the SynLidar dataset. Our contributions are a novel method that achieves state‑of‑the‑art results in Real‑to‑Real Unsupervised Domain Adaptation, and we also demonstrate an application of our method to segment rare classes, for which target 3D annotations are not available, by only using 2D annotations for those classes and leveraging 3D annotations for other classes in a source domain.

Abstract:
Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well‑known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi‑modal features can be categorized into two spectral components: low‑frequency features that provide broad scene context, including color variations and smooth areas, and high‑frequency features that capture modality‑specific details such as edges and textures. Inspired by this, we propose the Spectral‑aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi‑modal features by explicitly modeling the interactions between the high‑frequency, modality‑specific features. Our experimental results demonstrate that SGFNet outperforms the state‑of‑the‑art methods on the MFNet and PST900 datasets.

Abstract:
Autonomous robots must reason about the physical consequences of their actions to operate effectively in unstructured, real‑world environments. We present Scan, Materialize, Simulate (SMS), a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision‑language models for material property inference, and physics simulation for reliable prediction of action outcomes. By integrating these components, SMS enables generalizable physical reasoning and object‑centric planning without the need to re‑learn foundational physical dynamics. We empirically validate SMS in a billiards‑inspired manipulation task and a challenging quadrotor landing scenario, demonstrating robust performance on both simulated domain transfer and real‑world experiments. Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics‑based simulation to achieve physically grounded robot planning across diverse settings.

Abstract:
Three‑dimensional reconstruction of buildings, particularly at Level of Detail 1 (LOD1), plays a crucial role in various applications such as urban planning, urban environmental studies, and designing optimized transportation networks. This study focuses on assessing the potential of LiDAR data for accurate 3D building reconstruction at LOD1 and extracting morphological features from these models. Four deep semantic segmentation models, U‑Net, Attention U‑Net, U‑Net3+, and DeepLabV3+, were used, applying transfer learning to extract building footprints from LiDAR data. The results showed that U‑Net3+ and Attention U‑Net outperformed the others, achieving IoU scores of 0.833 and 0.814, respectively. Various statistical measures, including maximum, range, mode, median, and the 90th percentile, were used to estimate building heights, resulting in the generation of 3D models at LOD1. As the main contribution of the research, the impact of segmentation accuracy on the quality of 3D building modeling and the accuracy of morphological features like building area and external wall surface area was investigated. The results showed that the accuracy of building identification (segmentation performance) significantly affects the 3D model quality and the estimation of morphological features, depending on the height calculation method. Overall, the UNet3+ method, utilizing the 90th percentile and median measures, leads to accurate height estimation of buildings and the extraction of morphological features.

Abstract:
Semantic segmentation stands as a pivotal research focus in computer vision. In the context of industrial image inspection, conventional semantic segmentation models fail to maintain the segmentation consistency of fixed components across varying contextual environments due to a lack of perception of object contours. Given the real‑time constraints and limited computing capability of industrial image detection machines, it is also necessary to create efficient models to reduce computational complexity. In this work, a Shape‑Aware Efficient Network (SPENet) is proposed, which focuses on the shapes of objects to achieve excellent segmentation consistency by separately supervising the extraction of boundary and body information from images. In SPENet, a novel method is introduced for describing fuzzy boundaries to better adapt to real‑world scenarios named Variable Boundary Domain (VBD). Additionally, a new metric, Consistency Mean Square Error(CMSE), is proposed to measure segmentation consistency for fixed components. Our approach attains the best segmentation accuracy and competitive speed on our dataset, showcasing significant advantages in CMSE among numerous state‑of‑the‑art real‑time segmentation networks, achieving a reduction of over 50% compared to the previously top‑performing models.

Abstract:
Recently proposed neural network architectures like PointNet [QSMG16] and PointNet++ [QYSG17] have made it possible to apply Deep Learning to 3D point sets. The feature representations of shapes learned by these two networks enabled training classifiers for Semantic Segmentation, and more recently for Instance Segmentation via the Similarity Group Proposal Network (SGPN) [WYHN17]. One area of improvement which has been highlighted by SGPN's authors, pertains to use of memory intensive similarity matrices which occupy memory quadratic in the number of points. In this report, we attempt to tackle this issue through use of two sampling based methods, which compute Instance Segmentation on a sub‑sampled Point Set, and then extrapolate labels to the complete set using the nearest neigbhour approach. While both approaches perform equally well on large sub‑samples, the random‑based strategy gives the most improvements in terms of speed and memory usage.

Abstract:
Few‑Shot Segmentation (FSS) aims to learn class‑agnostic segmentation on few classes to segment arbitrary classes, but at the risk of overfitting. To address this, some methods use the well‑learned knowledge of foundation models (e.g., SAM) to simplify the learning process. Recently, SAM 2 has extended SAM by supporting video segmentation, whose class‑agnostic matching ability is useful to FSS. A simple idea is to encode support foreground (FG) features as memory, with which query FG features are matched and fused. Unfortunately, the FG objects in different frames of SAM 2's video data are always the same identity, while those in FSS are different identities, i.e., the matching step is incompatible. Therefore, we design Pseudo Prompt Generator to encode pseudo query memory, matching with query features in a compatible way. However, the memories can never be as accurate as the real ones, i.e., they are likely to contain incomplete query FG, and some unexpected query background (BG) features, leading to wrong segmentation. Hence, we further design Iterative Memory Refinement to fuse more query FG features into the memory, and devise a Support‑Calibrated Memory Attention to suppress the unexpected query BG features in memory. Extensive experiments have been conducted on PASCAL‑5^i and COCO‑20^i to validate the effectiveness of our design, e.g., the 1‑shot mIoU can be 4.2% better than the best baseline.

Abstract:
We introduce Land‑MoE, a novel approach for multispectral land cover classification (MLCC). Spectral shift, which emerges from disparities in sensors and geospatial conditions, poses a significant challenge in this domain. Existing methods predominantly rely on domain adaptation and generalization strategies, often utilizing small‑scale models that exhibit limited performance. In contrast, Land‑MoE addresses these issues by hierarchically inserting a Frequency‑aware Mixture of Low‑rank Token Experts, to fine‑tune Vision Foundation Models (VFMs) in a parameter‑efficient manner. Specifically, Land‑MoE comprises two key modules: the mixture of low‑rank token experts (MoLTE) and frequency‑aware filters (FAF). MoLTE leverages rank‑differentiated tokens to generate diverse feature adjustments for individual instances within multispectral images. By dynamically combining learnable low‑rank token experts of varying ranks, it enhances the robustness against spectral shifts. Meanwhile, FAF conducts frequency‑domain modulation on the refined features. This process enables the model to effectively capture frequency band information that is strongly correlated with semantic essence, while simultaneously suppressing frequency noise irrelevant to the task. Comprehensive experiments on MLCC tasks involving cross‑sensor and cross‑geospatial setups demonstrate that Land‑MoE outperforms existing methods by a large margin. Additionally, the proposed approach has also achieved state‑of‑the‑art performance in domain generalization semantic segmentation tasks of RGB remote sensing images.

Abstract:
Vision Mamba offers linear complexity for long visual sequences, yet its performance depends critically on how a two‑dimensional patch grid is serialized into a one‑dimensional state‑space recurrence. Raster‑style scans disrupt spatial continuity, and the mismatch between 2D locality and 1D state propagation becomes increasingly severe when the inference resolution grows beyond the training grid. This paper presents FractalMamba++, a resolution‑scalable vision backbone organized around a single geometric principle: the recursive self‑similar structure of the Hilbert curve determines how patches are serialized, where long‑range state shortcuts are inserted, and how positional relations are encoded. First, Hilbert‑curve‑based Fractal Serialization preserves local 2D neighborhoods more faithfully than linear scans and provides consistent neighborhood statistics across resolutions. Second, the Fractal Hierarchy Skip Connection (FHSC) derives a compact set of deterministic state‑injection routes from Hilbert recursion levels, mitigating long‑sequence information fading without runtime search, hand‑derived gradients, or dedicated CUDA kernels. Third, Fractal‑Aware 2D Rotary Position Encoding (FA‑RoPE) combines normalized 2D coordinates with a fractal hierarchy level so that feature interactions depend on actual spatial proximity and recursive structural role rather than serialized 1D distance. Extensive experiments on ImageNet‑1K classification, COCO detection and instance segmentation, ADE20K semantic segmentation, and LEVIR‑CD+ remote sensing change detection show that FractalMamba++ improves over existing Mamba‑based vision backbones, especially under high‑resolution inputs.

Abstract:
Recent efforts have explored multimodal semantic segmentation using various backbone architectures. However, while most methods aim to improve accuracy, their computational efficiency remains underexplored. To address this, we propose EGFormer, an efficient multimodal semantic segmentation framework that flexibly integrates an arbitrary number of modalities while significantly reducing model parameters and inference time without sacrificing performance. Our framework introduces two novel modules. First, the Any‑modal Scoring Module (ASM) assigns importance scores to each modality independently, enabling dynamic ranking based on their feature maps. Second, the Modal Dropping Module (MDM) filters out less informative modalities at each stage, selectively preserving and aggregating only the most valuable features. This design allows the model to leverage useful information from all available modalities while discarding redundancy, thus ensuring high segmentation quality. In addition to efficiency, we evaluate EGFormer on a synthetic‑to‑real transfer task to demonstrate its generalizability. Extensive experiments show that EGFormer achieves competitive performance with up to 88 percent reduction in parameters and 50 percent fewer GFLOPs. Under unsupervised domain adaptation settings, it further achieves state‑of‑the‑art transfer performance compared to existing methods.

Abstract:
Supervised learning demands large amounts of precisely annotated data to achieve promising results. Such data curation is labor‑intensive and imposes significant overhead regarding time and costs. Self‑supervised learning (SSL) partially overcomes these limitations by exploiting vast amounts of unlabeled data and creating surrogate (pretext or proxy) tasks to learn useful representations without manual labeling. As a result, SSL has become a powerful machine learning (ML) paradigm for solving several practical downstream computer vision problems, such as classification, detection, and segmentation. Image segmentation is the cornerstone of many high‑level visual perception applications, including medical imaging, intelligent transportation, agriculture, and surveillance. Although there is substantial research potential for developing advanced algorithms for SSL‑based semantic segmentation, a comprehensive study of existing methodologies is essential to trace advances and guide emerging researchers. This survey thoroughly investigates over 150 recent image segmentation articles, particularly focusing on SSL. It provides a practical categorization of pretext tasks, downstream tasks, and commonly used benchmark datasets for image segmentation research. It concludes with key observations distilled from a large body of literature and offers future directions to make this research field more accessible and comprehensible for readers.

Abstract:
We propose FlowCut, a simple and capable method for unsupervised video instance segmentation consisting of a three‑stage framework to construct a high‑quality video dataset with pseudo labels. To our knowledge, our work is the first attempt to curate a video dataset with pseudo‑labels for unsupervised video instance segmentation. In the first stage, we generate pseudo‑instance masks by exploiting the affinities of features from both images and optical flows. In the second stage, we construct short video segments containing high‑quality, consistent pseudo‑instance masks by temporally matching them across the frames. In the third stage, we use the YouTubeVIS‑2021 video dataset to extract our training instance segmentation set, and then train a video segmentation model. FlowCut achieves state‑of‑the‑art performance on the YouTubeVIS‑2019, YouTubeVIS‑2021, DAVIS‑2017, and DAVIS‑2017 Motion benchmarks.

Abstract:
Pre‑training on real‑image datasets has been widely proven effective for improving instance segmentation. However, industrial applications face two key challenges: (1) legal and ethical restrictions, such as ImageNet's prohibition of commercial use, and (2) limited transferability due to the domain gap between web images and industrial imagery. Even recent vision foundation models, including the segment anything model (SAM), show notable performance degradation in industrial settings. These challenges raise critical questions: Can we build a vision foundation model for industrial applications without relying on real images or manual annotations? And can such models outperform even fine‑tuned SAM on industrial datasets? To address these questions, we propose the Instance Core Segmentation Dataset (InsCore), a synthetic pre‑training dataset based on formula‑driven supervised learning (FDSL). InsCore generates fully annotated instance segmentation images that reflect key characteristics of industrial data, including complex occlusions, dense hierarchical masks, and diverse non‑rigid shapes, distinct from typical web imagery. Unlike previous methods, InsCore requires neither real images nor human annotations. Experiments on five industrial datasets show that models pre‑trained with InsCore outperform those trained on COCO and ImageNet‑21k, as well as fine‑tuned SAM, achieving an average improvement of 6.2 points in instance segmentation performance. This result is achieved using only 100k synthetic images, more than 100 times fewer than the 11 million images in SAM's SA‑1B dataset, demonstrating the data efficiency of our approach. These findings position InsCore as a practical and license‑free vision foundation model for industrial applications.

Abstract:
Multi‑modal semantic segmentation (MMSS) faces significant challenges in real‑world applications due to incomplete, degraded, or missing sensor data. While current MMSS methods typically use self‑distillation with modality dropout to improve robustness, they largely overlook inter‑modal correlations and thus suffer significant performance degradation when no modalities are missing. To this end, we present RMMSS, a two‑stage framework designed to progressively enhance model robustness under missing‑modality conditions, while maintaining strong performance in full‑modality scenarios. It comprises two key components: the Hybrid Prototype Distillation Module (HPDM) and the Feature Selection Module (FSM). In the first stage, we pre‑train the teacher model with full‑modality data and then introduce HPDM to do cross‑modal knowledge distillation for obtaining a highly robust model. In the second stage, we freeze both the pre‑trained full‑modality teacher model and the robust model and propose a trainable FSM that extracts optimal representations from both the feature and logits layers of the models via feature score calculation. This process learns a final student model that maintains strong robustness while achieving high performance under full‑modality conditions. Our experiments on three datasets demonstrate that our method improves missing‑modality performance by 2.80%, 3.89%, and 0.89%, respectively, compared to the state‑of‑the‑art, while causing almost no drop in full‑modality performance (only ‑0.1% mIoU). Meanwhile, different backbones (AnySeg and CMNeXt) are utilized to validate the generalizability of our framework.

Abstract:
Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill‑conditioning, which hampers gradient‑based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill‑conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.

Abstract:
Segmentation evaluation metrics traditionally rely on binary decision logic: predictions are either correct or incorrect, based on rigid IoU thresholds. Detection‑‑based metrics such as F1 and mAP determine correctness at the object level using fixed overlap cutoffs, while overlap‑‑based metrics like Intersection over Union (IoU) and Dice operate at the pixel level, often overlooking instance‑‑level structure. Panoptic Quality (PQ) attempts to unify detection and segmentation assessment, but it remains dependent on hard‑threshold matching‑‑treating predictions below the threshold as entirely incorrect. This binary framing obscures important distinctions between qualitatively different errors and fails to reward gradual model improvements. We propose SoftPQ, a flexible and interpretable instance segmentation metric that redefines evaluation as a graded continuum rather than a binary classification. SoftPQ introduces tunable upper and lower IoU thresholds to define a partial matching region and applies a sublinear penalty function to ambiguous or fragmented predictions. These extensions allow SoftPQ to exhibit smoother score behavior, greater robustness to structural segmentation errors, and more informative feedback for model development and evaluation. Through controlled perturbation experiments, we show that SoftPQ captures meaningful differences in segmentation quality that existing metrics overlook, making it a practical and principled alternative for both benchmarking and iterative model refinement.

Abstract:
Remote sensing image (RSI) interpretation typically faces challenges due to the scarcity of labeled data, which limits the performance of RSI interpretation tasks. To tackle this challenge, we propose EarthSynth, a diffusion‑based generative foundation model that enables synthesizing multi‑category, cross‑satellite labeled Earth observation for downstream RSI interpretation tasks. To the best of our knowledge, EarthSynth is the first to explore multi‑task generation for remote sensing, tackling the challenge of limited generalization in task‑oriented synthesis for RSI interpretation. EarthSynth, trained on the EarthSynth‑180K dataset, employs the Counterfactual Composition training strategy with a three‑dimensional batch‑sample selection mechanism to improve training data diversity and enhance category control. Furthermore, a rule‑based method of R‑Filter is proposed to filter more informative synthetic data for downstream tasks. We evaluate our EarthSynth on scene classification, object detection, and semantic segmentation in open‑world scenarios. There are significant improvements in open‑vocabulary understanding tasks, offering a practical solution for advancing RSI interpretation.

Abstract:
We report on the application of a high‑capacity semantic segmentation pipeline to the GOOSE 2D Semantic Segmentation Challenge for unstructured off‑road environments. Using a FlashInternImage‑B backbone together with a UPerNet decoder, we adapt established techniques, rather than designing new ones, to the distinctive conditions of off‑road scenes. Our training recipe couples strong photometric distortion augmentation (to emulate the wide lighting variations of outdoor terrain) with an Exponential Moving Average (EMA) of weights for better generalization. Using only the GOOSE training dataset, we achieve 88.8% mIoU on the validation set.

Abstract:
Open‑vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre‑trained vision‑language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text‑aligned features limits shallow‑level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual‑prompt cost volume generation, a cost volume‑guided decoder, and a semantic‑guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi‑level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state‑of‑the‑art approaches on multiple public datasets.

Abstract:
Point‑cloud semantic segmentation underpins a wide range of critical applications. Although recent deep architectures and large‑scale datasets have driven impressive closed‑set performance, these models struggle to recognize or properly segment objects outside their training classes. This gap has sparked interest in Open‑Set Semantic Segmentation (O3S), where models must both correctly label known categories and detect novel, unseen classes. In this paper, we propose a plug and play framework for O3S. By modeling the segmentation pipeline as a conditional Markov chain, we derive a novel regularizer term dubbed Conditional Channel Capacity Maximization (3CM), that maximizes the mutual information between features and predictions conditioned on each class. When incorporated into standard loss functions, 3CM encourages the encoder to retain richer, label‑dependent features, thereby enhancing the network's ability to distinguish and segment previously unseen categories. Experimental results demonstrate effectiveness of proposed method on detecting unseen objects. We further outline future directions for dynamic open‑world adaptation and efficient information‑theoretic estimation.

Abstract:
LiDAR‑based semantic segmentation plays a vital role in autonomous driving by enabling detailed understanding of 3D environments. However, annotating LiDAR point clouds is extremely costly and requires assigning semantic labels to millions of points with complex geometric structures. Active Learning (AL) has emerged as a promising approach to reduce labeling costs by querying only the most informative samples. Yet, existing AL methods face critical challenges when applied to large‑scale 3D data: outdoor scenes contain an overwhelming number of points and suffer from severe class imbalance, where rare classes have far fewer points than dominant classes. To address these issues, we propose SELECT, a voxel‑centric submodular approach tailored for active LiDAR semantic segmentation. Our method targets both scalability problems and class imbalance through three coordinated stages. First, we perform Voxel‑Level Submodular Subset Selection, which efficiently identifies representative voxels without pairwise comparisons, ensuring scalability. Second, we estimate Voxel‑Level Model Uncertainty using Monte Carlo dropout, aggregating point‑wise uncertainties to identify informative voxels. Finally, we introduce Submodular Maximization for Point‑Level Class Balancing, which selects a subset of points that enhances label diversity, explicitly mitigating class imbalance. Experiments on SemanticPOSS, SemanticKITTI, and nuScenes benchmarks demonstrate that SELECT achieves superior performance compared to prior active learning approaches for 3D semantic segmentation.

Abstract:
Accurate pose estimation of surgical tools in Robot‑assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker‑based methods offer accuracy, they face challenges with occlusions, reflections, and tool‑specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero‑shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state‑of‑the‑art zero‑shot RGB‑D models like the FoundationPose and SAM‑6D. We advanced these models by incorporating vision‑based depth estimation using the RAFT‑Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM‑6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine‑tuned Mask R‑CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM‑6D surpasses FoundationPose in zero‑shot pose estimation of unseen surgical instruments, setting a new benchmark for zero‑shot RGB‑D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB‑D zero‑shot methods in RMIS.

Abstract:
Semi‑Supervised Instance Segmentation (SSIS) involves classifying and grouping image pixels into distinct object instances using limited labeled data. This learning paradigm usually faces a significant challenge of unstable performance caused by noisy pseudo‑labels of instance categories and pixel masks. We find that the prevalent practice of filtering instance pseudo‑labels assessing both class and mask quality with a single score threshold, frequently leads to compromises in the trade‑off between the qualities of class and mask labels. In this paper, we introduce a novel Pseudo‑Label Quality Decoupling and Correction (PL‑DC) framework for SSIS to tackle the above challenges. Firstly, at the instance level, a decoupled dual‑threshold filtering mechanism is designed to decouple class and mask quality estimations for instance‑level pseudo‑labels, thereby independently controlling pixel classifying and grouping qualities. Secondly, at the category level, we introduce a dynamic instance category correction module to dynamically correct the pseudo‑labels of instance categories, effectively alleviating category confusion. Lastly, we introduce a pixel‑level mask uncertainty‑aware mechanism at the pixel level to re‑weight the mask loss for different pixels, thereby reducing the impact of noise introduced by pixel‑level mask pseudo‑labels. Extensive experiments on the COCO and Cityscapes datasets demonstrate that the proposed PL‑DC achieves significant performance improvements, setting new state‑of‑the‑art results for SSIS. Notably, our PL‑DC shows substantial gains even with minimal labeled data, achieving an improvement of +11.6 mAP with just 1% COCO labeled data and +15.5 mAP with 5% Cityscapes labeled data. The code will be public.

Abstract:
This work addresses the task of completely weakly supervised class‑incremental learning for semantic segmentation to learn segmentation for both base and additional novel classes using only image‑level labels. While class‑incremental semantic segmentation (CISS) is crucial for handling diverse and newly emerging objects in the real world, traditional CISS methods require expensive pixel‑level annotations for training. To overcome this limitation, partially weakly‑supervised approaches have recently been proposed. However, to the best of our knowledge, this is the first work to introduce a completely weakly‑supervised method for CISS. To achieve this, we propose to generate robust pseudo‑labels by combining pseudo‑labels from a localizer and a sequence of foundation models based on their uncertainty. Moreover, to mitigate catastrophic forgetting, we introduce an exemplar‑guided data augmentation method that generates diverse images containing both previous and novel classes with guidance. Finally, we conduct experiments in three common experimental settings: 15‑5 VOC, 10‑10 VOC, and COCO‑to‑VOC, and in two scenarios: disjoint and overlap. The experimental results demonstrate that our completely weakly supervised method outperforms even partially weakly supervised methods in the 15‑5 VOC and 10‑10 VOC settings while achieving competitive accuracy in the COCO‑to‑VOC setting.

Abstract:
We introduce Infinigen‑Articulated, a toolkit for generating realistic, procedurally generated articulated assets for robotics simulation. We include procedural generators for 18 common articulated object categories along with high‑level utilities for use creating custom articulated assets in Blender. We also provide an export pipeline to integrate the resulting assets along with their physical properties into common robotics simulators. Experiments demonstrate that assets sampled from these generators are effective for movable object segmentation, training generalizable reinforcement learning policies, and sim‑to‑real transfer of imitation learning policies.

Abstract:
Although the use of remote sensing technologies for monitoring forested environments has gained increasing attention, publicly available point cloud datasets remain scarce due to the high costs, sensor requirements, and time‑intensive nature of their acquisition. Moreover, as far as we are aware, there are no public annotated datasets generated through Structure From Motion (SfM) algorithms applied to imagery, which may be due to the lack of SfM algorithms that can map semantic segmentation information into an accurate point cloud, especially in a challenging environment like forests. In this work, we present a novel pipeline for generating semantically segmented point clouds of forest environments. Using a custom‑built forest simulator, we generate realistic RGB images of diverse forest scenes along with their corresponding semantic segmentation masks. These labeled images are then processed using modified open‑source SfM software capable of preserving semantic information during 3D reconstruction. The resulting point clouds provide both geometric and semantic detail, offering a valuable resource for training and evaluating deep learning models aimed at segmenting real forest point clouds obtained via SfM.

Abstract:
Accurate food volume estimation is crucial for medical nutrition management and health monitoring applications, but current food volume estimation methods are often limited by mononuclear data, leveraging single‑purpose hardware such as 3D scanners, gathering sensor‑oriented information such as depth information, or relying on camera calibration using a reference object. In this paper, we present VolE, a novel framework that leverages mobile device‑driven 3D reconstruction to estimate food volume. VolE captures images and camera locations in free motion to generate precise 3D models, thanks to AR‑capable mobile devices. To achieve real‑world measurement, VolE is a reference‑ and depth‑free framework that leverages food video segmentation for food mask generation. We also introduce a new food dataset encompassing the challenging scenarios absent in the previous benchmarks. Our experiments demonstrate that VolE outperforms the existing volume estimation techniques across multiple datasets by achieving 2.22 % MAPE, highlighting its superior performance in food volume estimation.

Abstract:
Federated semantic segmentation enables pixel‑level classification in images through collaborative learning while maintaining data privacy. However, existing research commonly overlooks the fine‑grained class relationships within the semantic space when addressing heterogeneous problems, particularly domain shift. This oversight results in ambiguities between class representation. To overcome this challenge, we propose a novel federated segmentation framework that strikes class consistency, termed FedSaaS. Specifically, we introduce class exemplars as a criterion for both local‑ and global‑level class representations. On the server side, the uploaded class exemplars are leveraged to model class prototypes, which supervise global branch of clients, ensuring alignment with global‑level representation. On the client side, we incorporate an adversarial mechanism to harmonize contributions of global and local branches, leading to consistent output. Moreover, multilevel contrastive losses are employed on both sides to enforce consistency between two‑level representations in the same semantic space. Extensive experiments on several driving scene segmentation datasets demonstrate that our framework outperforms state‑of‑the‑art methods, significantly improving average segmentation accuracy and effectively addressing the class‑consistency representation problem.

Abstract:
This paper presents a Multi‑Elevation Semantic Segmentation Image (MESSI) dataset comprising 2525 images taken by a drone flying over dense urban environments. MESSI is unique in two main features. First, it contains images from various altitudes, allowing us to investigate the effect of depth on semantic segmentation. Second, it includes images taken from several different urban regions (at different altitudes). This is important since the variety covers the visual richness captured by a drone's 3D flight, performing horizontal and vertical maneuvers. MESSI contains images annotated with location, orientation, and the camera's intrinsic parameters and can be used to train a deep neural network for semantic segmentation or other applications of interest (e.g., localization, navigation, and tracking). This paper describes the dataset and provides annotation details. It also explains how semantic segmentation was performed using several neural network models and shows several relevant statistics. MESSI will be published in the public domain to serve as an evaluation benchmark for semantic segmentation using images captured by a drone or similar vehicle flying over a dense urban environment.

Abstract:
This research investigates the application of computer vision for rapid, accurate, and non‑invasive food quality assessment, focusing on the novel challenge of real‑time raspberry grading into five distinct classes within an industrial environment as the fruits move along a conveyor belt. To address this, a dedicated dataset of raspberries, namely RaspGrade, was acquired and meticulously annotated. Instance segmentation experiments revealed that accurate fruit‑level masks can be obtained; however, the classification of certain raspberry grades presents challenges due to color similarities and occlusion, while others are more readily distinguishable based on color. The acquired and annotated RaspGrade dataset is accessible on Hugging Face at: https://huggingface.co/datasets/FBK‑TeV/RaspGrade.

Abstract:
Accurate segmentation of tubular topological structures (e.g., fissures and vasculature) is critical in various fields to guarantee dependable downstream quantitative analysis and modeling. However, in dense prediction tasks such as semantic segmentation and super‑resolution, conventional upsampling operators cannot accommodate the slenderness of tubular structures and the curvature of morphology. This paper introduces a dynamic snake upsampling operators and a boundary‑skeleton weighted loss tailored for topological tubular structures. Specifically, we design a snake upsampling operators based on an adaptive sampling domain, which dynamically adjusts the sampling stride according to the feature map and selects a set of subpixel sampling points along the serpentine path, enabling more accurate subpixel‑level feature recovery for tubular structures. Meanwhile, we propose a skeleton‑to‑boundary increasing weighted loss that trades off main body and boundary weight allocation based on mask class ratio and distance field, preserving main body overlap while enhancing focus on target topological continuity and boundary alignment precision. Experiments across various domain datasets and backbone networks show that this plug‑and‑play dynamic snake upsampling operator and boundary‑skeleton weighted loss boost both pixel‑wise segmentation accuracy and topological consistency of results.

Abstract:
Large language models achieve high task performance yet often hallucinate or rely on outdated knowledge. Retrieval‑augmented generation (RAG) addresses these gaps by coupling generation with external search. We analyse how hyperparameters influence speed and quality in RAG systems, covering Chroma and Faiss vector stores, chunking policies, cross‑encoder re‑ranking, and temperature, and we evaluate six metrics: faithfulness, answer correctness, answer relevancy, context precision, context recall, and answer similarity. Chroma processes queries 13% faster, whereas Faiss yields higher retrieval precision, revealing a clear speed‑accuracy trade‑off. Naive fixed‑length chunking with small windows and minimal overlap outperforms semantic segmentation while remaining the quickest option. Re‑ranking provides modest gains in retrieval quality yet increases runtime by roughly a factor of 5, so its usefulness depends on latency constraints. These results help practitioners balance computational cost and accuracy when tuning RAG systems for transparent, up‑to‑date responses. Finally, we re‑evaluate the top configurations with a corrective RAG workflow and show that their advantages persist when the model can iteratively request additional evidence. We obtain a near‑perfect context precision (99%), which demonstrates that RAG systems can achieve extremely high retrieval accuracy with the right combination of hyperparameters, with significant implications for applications where retrieval quality directly impacts downstream task performance, such as clinical decision support in healthcare.

Abstract:
Tertiary lymphoid structures (TLS) are organized clusters of immune cells, whose maturity and area can be quantified in whole slide image (WSI) for various prognostic tasks. Existing methods for assessing these characteristics typically rely on cell proxy tasks and require additional post‑processing steps. In this work, We focus on a novel task‑TLS Semantic Segmentation (TLS‑SS)‑which segments both the regions and maturation stages of TLS in WSI in an end‑to‑end manner. Due to the extensive scale of WSI and patch‑based segmentation strategies, TLS‑SS necessitates integrating from neighboring patches to guide target patch (target) segmentation. Previous techniques often employ on multi‑resolution approaches, constraining the capacity to leverage the broader neighboring context while tend to preserve coarse‑grained information. To address this, we propose a GNN‑based Neighboring Context Aggregation Framework (GNCAF), which progressively aggregates multi‑hop neighboring context from the target and employs a self‑attention mechanism to guide the segmentation of the target. GNCAF can be integrated with various segmentation models to enhance their ability to perceive contextual information outside of the patch. We build two TLS‑SS datasets, called TCGA‑COAD and INHOUSE‑PAAD, and make the former (comprising 225 WSIs and 5041 TLSs) publicly available. Experiments on these datasets demonstrate the superiority of GNCAF, achieving a maximum of 22.08% and 26.57% improvement in mF1 and mIoU, respectively. Additionally, we also validate the task scalability of GNCAF on segmentation of lymph node metastases.

Abstract:
Semantic anomalies are contextually invalid or unusual combinations of familiar visual elements that can cause undefined behavior and failures in system‑level reasoning for autonomous systems. This work explores semantic anomaly detection by leveraging the semantic priors of state‑of‑the‑art vision foundation models, operating directly on the image. We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant. In this work, we consider two variants of the proposed framework: one using raw grid‑based embeddings, and another leveraging instance segmentation for object‑centric representations. To further improve robustness, we introduce a simple filtering mechanism to suppress false positives. Our evaluations on CARLA‑simulated anomalies show that the instance‑based method with filtering achieves performance comparable to GPT‑4o, while providing precise anomaly localization. These results highlight the potential utility of vision embeddings from foundation models for real‑time anomaly detection in autonomous systems.

Abstract:
User privacy is a crucial concern in robotic applications, especially when mobile service robots are deployed in personal or sensitive environments. However, many robotic downstream tasks require the use of cameras, which may raise privacy risks. To better understand user perceptions of privacy in relation to visual data, we conducted a user study investigating how different image modalities and image resolutions affect users' privacy concerns. The results show that depth images are broadly viewed as privacy‑safe, and a similarly high proportion of respondents feel the same about semantic segmentation images. Additionally, the majority of participants consider 3232 resolution RGB images to be almost sufficiently privacy‑preserving, while most believe that 1616 resolution can fully guarantee privacy protection.

Abstract:
Semi‑supervised learning leverages unlabeled data to enhance model performance, addressing the limitations of fully supervised approaches. Among its strategies, pseudo‑supervision has proven highly effective, typically relying on one or multiple teacher networks to refine pseudo‑labels before training a student network. A common practice in pseudo‑supervision is filtering pseudo‑labels based on pre‑defined confidence thresholds or entropy. However, selecting optimal thresholds requires large labeled datasets, which are often scarce in real‑world semi‑supervised scenarios. To overcome this challenge, we propose Ensemble‑of‑Confidence Reinforcement (ENCORE), a dynamic feedback‑driven thresholding strategy for pseudo‑label selection. Instead of relying on static confidence thresholds, ENCORE estimates class‑wise true‑positive confidence within the unlabeled dataset and continuously adjusts thresholds based on the model's response to different levels of pseudo‑label filtering. This feedback‑driven mechanism ensures the retention of informative pseudo‑labels while filtering unreliable ones, enhancing model training without manual threshold tuning. Our method seamlessly integrates into existing pseudo‑supervision frameworks and significantly improves segmentation performance, particularly in data‑scarce conditions. Extensive experiments demonstrate that integrating ENCORE with existing pseudo‑supervision frameworks enhances performance across multiple datasets and network architectures, validating its effectiveness in semi‑supervised learning.

Authors: Olaf Wysocki, Benedikt Schwab, Manoj Kumar Biswanath, Michael Greza, Qilin Zhang, Jingwei Zhu, Thomas Froech, Medhini Heeramaglore, Ihab Hijazi, Khaoula Kanna, Mathias Pechinger, Zhaiyu Chen, Yao Sun, Alejandro Rueda Segura, Ziyang Xu, Omar AbdelGafar, Mansour Mehranfar, Chandan Yeshwanth, Yueh-Cheng Liu, Hadi Yazdi, Jiapan Wang, Stefan Auer, Katharina Anders, Klaus Bogenberger, Andre Borrmann, Angela Dai, Ludwig Hoegner, Christoph Holst, Thomas H. Kolbe, Ferdinand Ludwig, Matthias Nießner, Frank Petzold, Xiao Xiang Zhu, Boris Jutzi

Abstract:
Urban Digital Twins (UDTs) have become essential for managing cities and integrating complex, heterogeneous data from diverse sources. Creating UDTs involves challenges at multiple process stages, including acquiring accurate 3D source data, reconstructing high‑fidelity 3D models, maintaining models' updates, and ensuring seamless interoperability to downstream tasks. Current datasets are usually limited to one part of the processing chain, hampering comprehensive UDTs validation. To address these challenges, we introduce the first comprehensive multimodal Urban Digital Twin benchmark dataset: TUM2TWIN. This dataset includes georeferenced, semantically aligned 3D models and networks along with various terrestrial, mobile, aerial, and satellite observations boasting 32 data subsets over roughly 100,000 m^2 and currently 767 GB of data. By ensuring georeferenced indoor‑outdoor acquisition, high accuracy, and multimodal data integration, the benchmark supports robust analysis of sensors and the development of advanced reconstruction methods. Additionally, we explore downstream tasks demonstrating the potential of TUM2TWIN, including novel view synthesis of NeRF and Gaussian Splatting, solar potential analysis, point cloud semantic segmentation, and LoD3 building reconstruction. We are convinced this contribution lays a foundation for overcoming current limitations in UDT creation, fostering new research directions and practical solutions for smarter, data‑driven urban environments. The project is available under: https://tum2t.win

Abstract:
Unsupervised Domain Adaptation (UDA) aims to align source and target domain distributions to close the domain gap, but still struggles with obtaining the target data. Fortunately, Domain Generalization (DG) excels without the need for any target data. Recent works expose that depth maps contribute to improved generalized performance in the UDA tasks, but they ignore the noise and holes in depth maps due to device and environmental factors, failing to sufficiently and effectively learn domain‑invariant representation. Although high‑sensitivity region suppression has shown promising results in learning domain‑invariant features, existing methods cannot be directly applicable to depth maps due to their unique characteristics. Hence, we propose a novel framework, namely Depth‑Sensitive Soft Suppression with RGB‑D inter‑modal stylization flow (DSSS), focusing on learning domain‑invariant features from depth maps for the DG semantic segmentation. Specifically, we propose the RGB‑D inter‑modal stylization flow to generate stylized depth maps for sensitivity detection, cleverly utilizing RGB information as the stylization source. Then, a class‑wise soft spatial sensitivity suppression is designed to identify and emphasize non‑sensitive depth features that contain more domain‑invariant information. Furthermore, an RGB‑D soft alignment loss is proposed to ensure that the stylized depth maps only align part of the RGB features while still retaining the unique depth information. To our best knowledge, our DSSS framework is the first work to integrate RGB and Depth information in the multi‑class DG semantic segmentation task. Extensive experiments over multiple backbone networks show that our framework achieves remarkable performance improvement.

Abstract:
This report presents our semantic segmentation framework developed by team ACVLAB for the ICRA 2025 GOOSE 2D Semantic Segmentation Challenge, which focuses on parsing outdoor scenes into nine semantic categories under real‑world conditions. Our method integrates a Swin Transformer backbone enhanced with Rotary Position Embedding (RoPE) for improved spatial generalization, alongside a Color Shift Estimation‑and‑Correction module designed to compensate for illumination inconsistencies in natural environments. To further improve training stability, we adopt a quantile‑based denoising strategy that downweights the top 2.5% of highest‑error pixels, treating them as noise and suppressing their influence during optimization. Evaluated on the official GOOSE test set, our approach achieved a mean Intersection over Union (mIoU) of 0.848, demonstrating the effectiveness of combining color correction, positional encoding, and error‑aware denoising in robust semantic segmentation.

Abstract:
In autonomous driving, thermal image semantic segmentation has emerged as a critical research area, owing to its ability to provide robust scene understanding under adverse visual conditions. In particular, unsupervised domain adaptation (UDA) for thermal image segmentation can be an efficient solution to address the lack of labeled thermal datasets. Nevertheless, since these methods do not effectively utilize the complementary information between RGB and thermal images, they significantly decrease performance during domain adaptation. In this paper, we present a comprehensive study on cross‑spectral UDA for thermal image semantic segmentation. We first propose a novel masked mutual learning strategy that promotes complementary information exchange by selectively transferring results between each spectral model while masking out uncertain regions. Additionally, we introduce a novel prototypical self‑supervised loss designed to enhance the performance of the thermal segmentation model in nighttime scenarios. This approach addresses the limitations of RGB pre‑trained networks, which cannot effectively transfer knowledge under low illumination due to the inherent constraints of RGB sensors. In experiments, our method achieves higher performance over previous UDA methods and comparable performance to state‑of‑the‑art supervised methods.

Abstract:
Fusing and balancing multi‑modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi‑modal frameworks to over‑rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real‑world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug‑and‑play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log‑Sobolev inequality to bound functional entropy using functional‑Fisher‑information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi‑scale regularization module is proposed to apply our proposed plug‑and‑play term on high‑level features and also segmentation predictions for more balanced multi‑modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25%, and +3.64%, without introducing any additional parameters.

Abstract:
Bird's‑Eye‑View (BEV) semantic segmentation provides comprehensive environmental perception for autonomous driving but suffers multi‑modal misalignment and sensor noise. We propose RESAR‑BEV, a progressive refinement framework that advances beyond single‑step end‑to‑end approaches: (1) progressive refinement through residual autoregressive learning that decomposes BEV segmentation into interpretable coarse‑to‑fine stages via our Drive‑Transformer and Modifier‑Transformer residual prediction cascaded architecture, (2) robust BEV representation combining ground‑proximity voxels with adaptive height offsets and dual‑path voxel feature encoding (max+attention pooling) for efficient feature extraction, and (3) decoupled supervision with offline Ground Truth decomposition and online joint optimization to prevent overfitting while ensuring structural coherence. Experiments on nuScenes demonstrate RESAR‑BEV achieves state‑of‑the‑art performance with 54.0% mIoU across 7 essential driving‑scene categories while maintaining real‑time capability at 14.6 FPS. The framework exhibits robustness in challenging scenarios of long‑range perception and adverse weather conditions.

Abstract:
This paper proposes a method MTL‑Swin‑Unet which is multi‑task learning using transformers for classification and semantic segmentation. For spurious‑correlation problems, this method allows us to enhance the image representation with two other image representations: representation obtained by semantic segmentation and representation obtained by image reconstruction. In our experiments, the proposed method outperformed in F‑value measure than other classifiers when the test data included slices from the same patient (no covariate shift). Similarly, when the test data did not include slices from the same patient (covariate shift setting), the proposed method outperformed in AUC measure.

Abstract:
Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re‑parameterization scheme for segmenting and classifying the nuclei in H\&E‑stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re‑parametrizable encoder‑decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re‑parameterization technique. Extensive experiments demonstrated the superiorities of RepSNet compared to several typical benchmark models.

Abstract:
The introduction of the Segment Anything Model (SAM) has paved the way for numerous semantic segmentation applications. For several tasks, quantifying the uncertainty of SAM is of particular interest. However, the ambiguous nature of the class‑agnostic foundation model SAM challenges current uncertainty quantification (UQ) approaches. This paper presents a theoretically motivated uncertainty quantification model based on a Bayesian entropy formulation jointly respecting aleatoric, epistemic, and the newly introduced task uncertainty. We use this formulation to train USAM, a lightweight post‑hoc UQ method. Our model traces the root of uncertainty back to under‑parameterised models, insufficient prompts or image ambiguities. Our proposed deterministic USAM demonstrates superior predictive capabilities on the SA‑V, MOSE, ADE20k, DAVIS, and COCO datasets, offering a computationally cheap and easy‑to‑use UQ alternative that can support user‑prompting, enhance semi‑supervised pipelines, or balance the tradeoff between accuracy and cost efficiency.

Abstract:
Zero‑shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine‑tuning vision‑language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query‑based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query‑based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region‑level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class‑level similarity and mask‑level consistency. Additionally, we introduce a Multi‑scale Feature Enhancement (MFE) module that refines decoder features through residual multi‑scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state‑of‑the‑art performance on two standard benchmarks.

Abstract:
The Segment Anything Model (SAM) is a popular vision foundation model; however, its high computational and memory demands make deployment on resource‑constrained devices challenging. While Post‑Training Quantization (PTQ) is a practical approach for reducing computational overhead, existing PTQ methods rely on fixed bit‑width quantization, leading to suboptimal accuracy and efficiency. To address this limitation, we propose Mix‑QSAM, a mixed‑precision PTQ framework for SAM. First, we introduce a layer‑wise importance score, derived using Kullback‑Leibler (KL) divergence, to quantify each layer's contribution to the model's output. Second, we introduce cross‑layer synergy, a novel metric based on causal mutual information, to capture dependencies between adjacent layers. This ensures that highly interdependent layers maintain similar bit‑widths, preventing abrupt precision mismatches that degrade feature propagation and numerical stability. Using these metrics, we formulate an Integer Quadratic Programming (IQP) problem to determine optimal bit‑width allocation under model size and bit‑operation constraints, assigning higher precision to critical layers while minimizing bit‑width in less influential layers. Experimental results demonstrate that Mix‑QSAM consistently outperforms existing PTQ methods on instance segmentation and object detection tasks, achieving up to 20% higher average precision under 6‑bit and 4‑bit mixed‑precision settings, while maintaining computational efficiency.

Abstract:
This study addresses the inherent limitations of Multi‑Layer Perceptrons (MLPs) in Vision Transformers (ViTs) by introducing Hybrid Kolmogorov‑Arnold Network (KAN)‑ViT (Hyb‑KAN ViT), a novel framework that integrates wavelet‑based spectral decomposition and spline‑optimized activation functions, prior work has failed to focus on the prebuilt modularity of the ViT architecture and integration of edge detection capabilities of Wavelet functions. We propose two key modules: Efficient‑KAN (Eff‑KAN), which replaces MLP layers with spline functions and Wavelet‑KAN (Wav‑KAN), leveraging orthogonal wavelet transforms for multi‑resolution feature extraction. These modules are systematically integrated in ViT encoder layers and classification heads to enhance spatial‑frequency modeling while mitigating computational bottlenecks. Experiments on ImageNet‑1K (Image Recognition), COCO (Object Detection and Instance Segmentation), and ADE20K (Semantic Segmentation) demonstrate state‑of‑the‑art performance with Hyb‑KAN ViT. Ablation studies validate the efficacy of wavelet‑driven spectral priors in segmentation and spline‑based efficiency in detection tasks. The framework establishes a new paradigm for balancing parameter efficiency and multi‑scale representation in vision architectures.

Abstract:
Image segmentation is a powerful computer vision technique for scene understanding. However, real‑world deployment is stymied by the need for high‑quality, meticulously labeled datasets. Synthetic data provides high‑quality labels while reducing the need for manual data collection and annotation. However, deep neural networks trained on synthetic data often face the Syn2Real problem, leading to poor performance in real‑world deployments. To mitigate the aforementioned gap in image segmentation, we propose RAFT, a novel framework for adapting image segmentation models using minimal labeled real‑world data through data and feature augmentations, as well as active learning. To validate RAFT, we perform experiments on the synthetic‑to‑real "SYNTHIA‑>Cityscapes" and "GTAV‑>Cityscapes" benchmarks. We managed to surpass the previous state of the art, HALO. SYNTHIA‑>Cityscapes experiences an improvement in mIoU upon domain adaptation of 2.1%/79.9%, and GTAV‑>Cityscapes experiences a 0.4%/78.2% improvement in mIoU. Furthermore, we test our approach on the real‑to‑real benchmark of "Cityscapes‑>ACDC", and again surpass HALO, with a gain in mIoU upon adaptation of 1.3%/73.2%. Finally, we examine the effect of the allocated annotation budget and various components of RAFT upon the final transfer mIoU.

Abstract:
We propose MFSeg, an efficient multi‑frame 3D semantic segmentation framework. By aggregating point cloud sequences at the feature level and regularizing the feature extraction and aggregation process, MFSeg reduces computational overhead while maintaining high accuracy. Moreover, by employing a lightweight MLP‑based point decoder, our method eliminates the need to upsample redundant points from past frames. Experiments on the nuScenes and Waymo datasets show that MFSeg outperforms existing methods, demonstrating its effectiveness and efficiency.

Abstract:
Automating leaf manipulation in agricultural settings faces significant challenges, including the variability of plant morphologies and deformable leaves. We propose a novel hybrid geometric‑neural approach for autonomous leaf grasping that combines traditional computer vision with neural networks through self‑supervised learning. Our method integrates YOLOv8 for instance segmentation and RAFT‑Stereo for 3D depth estimation to build rich leaf representations, which feed into both a geometric feature scoring pipeline and a neural refinement module (GraspPointCNN). The key innovation is our confidence‑weighted fusion mechanism that dynamically balances the contribution of each approach based on prediction certainty. Our self‑supervised framework uses the geometric pipeline as an expert teacher to automatically generate training data. Experiments demonstrate that our approach achieves an 88.0% success rate in controlled environments and 84.7% in real greenhouse conditions, significantly outperforming both purely geometric (75.3%) and neural (60.2%) methods. This work establishes a new paradigm for agricultural robotics where domain expertise is seamlessly integrated with machine learning capabilities, providing a foundation for fully automated crop monitoring systems.

Abstract:
Segmenting objects in an environment is a crucial task for autonomous driving and robotics, as it enables a better understanding of the surroundings of each agent. Although camera sensors provide rich visual details, they are vulnerable to adverse weather conditions. In contrast, radar sensors remain robust under such conditions, but often produce sparse and noisy data. Therefore, a promising approach is to fuse information from both sensors. In this work, we propose a novel framework to enhance camera‑only baselines by integrating a diffusion model into a camera‑radar fusion architecture. We leverage radar point features to create pseudo‑masks using the Segment‑Anything model, treating the projected radar points as point prompts. Additionally, we propose a noise reduction unit to denoise these pseudo‑masks, which are further used to generate inpainted images that complete the missing information in the original images. Our method improves the camera‑only segmentation baseline by 2.63% in mIoU and enhances our camera‑radar fusion architecture by 1.48% in mIoU on the Waterscenes dataset. This demonstrates the effectiveness of our approach for semantic segmentation using camera‑radar fusion under adverse weather conditions.

Abstract:
Semantic segmentation of 3D LiDAR point clouds, essential for autonomous driving and infrastructure management, is best achieved by supervised learning, which demands extensive annotated datasets and faces the problem of domain shifts. We introduce a new 3D semantic segmentation pipeline that leverages aligned scenes and state‑of‑the‑art 2D segmentation methods, avoiding the need for direct 3D annotation or reliance on additional modalities such as camera images at inference time. Our approach generates 2D views from LiDAR scans colored by sensor intensity and applies 2D semantic segmentation to these views using a camera‑domain pretrained model. The segmented 2D outputs are then back‑projected onto the 3D points, with a simple voting‑based estimator that merges the labels associated to each 3D point. Our main contribution is a global pipeline for 3D semantic segmentation requiring no prior 3D annotation and not other modality for inference, which can be used for pseudo‑label generation. We conduct a thorough ablation study and demonstrate the potential of the generated pseudo‑labels for the Unsupervised Domain Adaptation task.

Abstract:
Identifying and counting blood components such as red blood cells, various types of white blood cells, and platelets is a critical task for healthcare practitioners. Deep learning approaches, particularly convolutional neural networks (CNNs) using supervised learning strategies, have shown considerable success for such tasks. However, CNN based architectures such as U‑Net, often struggles to accurately identify platelets due to their sizes and high variability of features. To address these challenges, researchers have commonly employed strategies such as class weighted loss functions, which have demonstrated some success. However, this does not address the more significant challenge of platelet variability in size and tendency to form aggregates and associations with other blood components. In this study, we explored an alternative approach by investigating the role of convolutional kernels in mitigating these issues. We also assigned separate classes to singular platelets and platelet aggregates and performed semantic segmentation using various U‑Net architectures for identifying platelets. We then evaluated and compared two common methods (pixel area method and connected component analysis) for counting platelets and proposed an alternative approach specialized for single platelets and platelet aggregates. Our experiments provided results that showed significant improvements in the identification of platelets, highlighting the importance of optimizing convolutional operations and class designations. We show that the common practice of pixel area‑based counting often over estimate platelet counts, whereas the proposed method presented in this work offers significant improvements. We discuss in detail about these methods from segmentation masks.

Abstract:
Accurate dietary monitoring is essential for promoting healthier eating habits. A key area of research is how people interact and consume food using utensils and hands. By tracking their position and orientation, it is possible to estimate the volume of food being consumed, or monitor eating behaviours, highly useful insights into nutritional intake that can be more reliable than popular methods such as self‑reporting. Hence, this paper implements a system that analyzes stationary video feed of people eating, using 6D pose estimation to track hand and spoon movements to capture spatial position and orientation. In doing so, we examine the performance of two state‑of‑the‑art (SOTA) video object segmentation (VOS) models, both quantitatively and qualitatively, and identify main sources of error within the system.

Abstract:
The recent Segment Anything Model (SAM) demonstrates strong instance segmentation performance across various downstream tasks. However, SAM is trained solely on RGB data, limiting its direct applicability to RGB‑thermal (RGB‑T) semantic segmentation. Given that RGB‑T provides a robust solution for scene understanding in adverse weather and lighting conditions, such as low light and overexposure, we propose a novel framework, SARTM, which customizes the powerful SAM for RGB‑T semantic segmentation. Our key idea is to unleash the potential of SAM while introduce semantic understanding modules for RGB‑T data pairs. Specifically, our framework first involves fine tuning the original SAM by adding extra LoRA layers, aiming at preserving SAM's strong generalization and segmentation capabilities for downstream tasks. Secondly, we introduce language information as guidance for training our SARTM. To address cross‑modal inconsistencies, we introduce a Cross‑Modal Knowledge Distillation(CMKD) module that effectively achieves modality adaptation while maintaining its generalization capabilities. This semantic module enables the minimization of modality gaps and alleviates semantic ambiguity, facilitating the combination of any modality under any visual conditions. Furthermore, we enhance the segmentation performance by adjusting the segmentation head of SAM and incorporating an auxiliary semantic segmentation head, which integrates multi‑scale features for effective fusion. Extensive experiments are conducted across three multi‑modal RGBT semantic segmentation benchmarks: MFNET, PST900, and FMB. Both quantitative and qualitative results consistently demonstrate that the proposed SARTM significantly outperforms state‑of‑the‑art approaches across a variety of conditions.

Abstract:
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor‑intensive and time‑consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short‑Long Memory SAM 2 (SLM‑SAM 2), a novel architecture that integrates distinct short‑term and long‑term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM‑SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM‑SAM 2 also exhibits stronger resistance to over‑propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.

Abstract:
The pattern analysis of tree structure holds significant scientific value for genetic breeding and forestry management. The current trunk and branch extraction technologies are mainly LiDAR‑based or UAV‑based. The former approaches obtain high‑precision 3D data, but its equipment cost is high and the three‑dimensional (3D) data processing is complex. The latter approaches efficiently capture canopy information, but they miss the 3‑D structure of trees. In order to deal with the branch information extraction from the complex background interference and occlusion, this work proposes a novel WaveInst instance segmentation framework, involving a discrete wavelet transform, to enhance multi‑scale edge information for accurately improving tree structure extraction. Experimental results of the proposed model show superior performance on SynthTree43k, CaneTree100, Urban Street and our PoplarDataset. Moreover, we present a new Phenotypic dataset PoplarDataset, which is dedicated to extract tree structure and pattern analysis from artificial forest. The proposed method achieves a mean average precision of 49.6 and 24.3 for the structure extraction of mature and juvenile trees, respectively, surpassing the existing state‑of‑the‑art method by 9.9. Furthermore, by in tegrating the segmentation model within the regression model, we accurately achieve significant tree grown parameters, such as the location of trees, the diameter‑at‑breast‑height of individual trees, and the plant height, from 2D images directly. This study provides a scientific and plenty of data for tree structure analysis in related to the phenotype research, offering a platform for the significant applications in precision forestry, ecological monitoring, and intelligent breeding.

Abstract:
Understanding causal event relationships and achieving fine‑grained temporal grounding in videos remain challenging for vision‑language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine‑grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two‑stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step‑by‑step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non‑overlapping events with detailed, timestamp‑aligned descriptions. We train TEMPURA on VER, a large‑scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine‑grained temporal segmentation leads to improved video understanding.

Abstract:
Remote sensing enables a wide range of critical applications such as land cover and land use mapping, crop yield prediction, and environmental monitoring. Advances in satellite technology have expanded remote sensing datasets, yet high‑performance segmentation models remain dependent on extensive labeled data, challenged by annotation scarcity and variability across sensors, illumination, and geography. Domain adaptation offers a promising solution to improve model generalization. This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft‑alignment pseudo‑labeling with source‑to‑target generative pre‑training. We further provide new mathematical insights into MAE‑based generative learning for domain‑invariant feature learning. Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.

Abstract:
Audio‑visual segmentation aims to separate sounding objects from videos by predicting pixel‑level masks based on audio signals. Existing methods primarily concentrate on closed‑set scenarios and direct audio‑visual alignment and fusion, which limits their capability to generalize to new, unseen situations. In this paper, we propose OpenAVS, a novel training‑free language‑based approach that, for the first time, effectively aligns audio and visual modalities using text as a proxy for open‑vocabulary Audio‑Visual Segmentation (AVS). Equipped with multimedia foundation models, OpenAVS directly infers masks through 1) audio‑to‑text prompt generation, 2) LLM‑guided prompt translation, and 3) text‑to‑visual sounding object segmentation. The objective of OpenAVS is to establish a simple yet flexible architecture that relies on the most appropriate foundation models by fully leveraging their capabilities to enable more effective knowledge transfer to the downstream AVS task. Moreover, we present a model‑agnostic framework OpenAVS‑ST that enables the integration of OpenAVS with any advanced supervised AVS model via pseudo‑label based self‑training. This approach enhances performance by effectively utilizing large‑scale unlabeled data when available. Comprehensive experiments on three benchmark datasets demonstrate the superior performance of OpenAVS. It surpasses existing unsupervised, zero‑shot, and few‑shot AVS methods by a significant margin, achieving absolute performance gains of approximately 9.4% and 10.9% in mIoU and F‑score, respectively, in challenging scenarios.

Abstract:
This paper introduces GeloVec, a new CNN‑based attention smoothing framework for semantic segmentation that addresses critical limitations in conventional approaches. While existing attention‑backed segmentation methods suffer from boundary instability and contextual discontinuities during feature mapping, our framework implements a higher‑dimensional geometric smoothing method to establish a robust manifold relationships between visually coherent regions. GeloVec combines modified Chebyshev distance metrics with multispatial transformations to enhance segmentation accuracy through stabilized feature extraction. The core innovation lies in the adaptive sampling weights system that calculates geometric distances in n‑dimensional feature space, achieving superior edge preservation while maintaining intra‑class homogeneity. The multispatial transformation matrix incorporates tensorial projections with orthogonal basis vectors, creating more discriminative feature representations without sacrificing computational efficiency. Experimental validation across multiple benchmark datasets demonstrates significant improvements in segmentation performance, with mean Intersection over Union (mIoU) gains of 2.1%, 2.7%, and 2.4% on Caltech Birds‑200, LSDSC, and FSSD datasets respectively compared to state‑of‑the‑art methods. GeloVec's mathematical foundation in Riemannian geometry provides theoretical guarantees on segmentation stability. Importantly, our framework maintains computational efficiency through parallelized implementation of geodesic transformations and exhibits strong generalization capabilities across disciplines due to the absence of information loss during transformations.

Abstract:
The recent Segment Anything Model 2 (SAM2) has demonstrated exceptional capabilities in interactive object segmentation for both images and videos. However, as a foundational model on interactive segmentation, SAM2 performs segmentation directly based on mask memory from the past six frames, leading to two significant challenges. Firstly, during inference in videos, objects may disappear since SAM2 relies solely on memory without accounting for object motion information, which limits its long‑range object tracking capabilities. Secondly, its memory is constructed from fixed past frames, making it susceptible to challenges associated with object disappearance or occlusion, due to potentially inaccurate segmentation results in memory. To address these problems, we present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory. Firstly, we propose Motion‑Guided Prompting (MGP), which represents the object motion in both sparse and dense manners, then injects them into SAM2 through a set of motion‑guided prompts. MGP enables the model to adjust its focus towards the direction of motion, thereby enhancing the object tracking capabilities. Furthermore, acknowledging that past segmentation results may be inaccurate, we devise a Spatial‑Temporal Memory Selection (ST‑MS) mechanism that dynamically identifies frames likely to contain accurate segmentation in both pixel‑ and frame‑level. By eliminating potentially inaccurate mask predictions from memory, we can leverage more reliable memory features to exploit similar regions for improving segmentation results. Extensive experiments on various benchmarks of video object segmentation and video instance segmentation demonstrate that our MoSAM achieves state‑of‑the‑art results compared to other competitors.

Abstract:
3D morphable models (3DMMs) are a powerful tool to represent the possible shapes and appearances of an object category. Given a single test image, 3DMMs can be used to solve various tasks, such as predicting the 3D shape, pose, semantic correspondence, and instance segmentation of an object. Unfortunately, 3DMMs are only available for very few object categories that are of particular interest, like faces or human bodies, as they require a demanding 3D data acquisition and category‑specific training process. In contrast, we introduce a new method, Common3D, that learns 3DMMs of common objects in a fully self‑supervised manner from a collection of object‑centric videos. For this purpose, our model represents objects as a learned 3D template mesh and a deformation field that is parameterized as an image‑conditioned neural network. Different from prior works, Common3D represents the object appearance with neural features instead of RGB colors, which enables the learning of more generalizable representations through an abstraction from pixel intensities. Importantly, we train the appearance features using a contrastive objective by exploiting the correspondences defined through the deformable template mesh. This leads to higher quality correspondence features compared to related works and a significantly improved model performance at estimating 3D object pose and semantic correspondence. Common3D is the first completely self‑supervised method that can solve various vision tasks in a zero‑shot manner.

Abstract:
Successful video analysis relies on accurate recognition of pixels across frames, and frame reconstruction methods based on video correspondence learning are popular due to their efficiency. Existing frame reconstruction methods, while efficient, neglect the value of direct involvement of multiple reference frames for reconstruction and decision‑making aspects, especially in complex situations such as occlusion or fast movement. In this paper, we introduce a Dynamic Memory Prediction (DMP) framework that innovatively utilizes multiple reference frames to concisely and directly enhance frame reconstruction. Its core component is a Reference Frame Memory Engine that dynamically selects frames based on object pixel features to improve tracking accuracy. In addition, a Bidirectional Target Prediction Network is built to utilize multiple reference frames to improve the robustness of the model. Through experiments, our algorithm outperforms the state‑of‑the‑art self‑supervised techniques on two fine‑grained video object tracking tasks: object segmentation and keypoint tracking.

Abstract:
The high degrees of freedom and complex structure of garments present significant challenges for clothing manipulation. In this paper, we propose a general topological dynamics model to fold complex clothing. By utilizing the visible folding structure as the topological skeleton, we design a novel topological graph to represent the clothing state. This topological graph is low‑dimensional and applied for complex clothing in various folding states. It indicates the constraints of clothing and enables predictions regarding clothing movement. To extract graphs from self‑occlusion, we apply semantic segmentation to analyze the occlusion relationships and decompose the clothing structure. The decomposed structure is then combined with keypoint detection to generate the topological graph. To analyze the behavior of the topological graph, we employ an improved Graph Neural Network (GNN) to learn the general dynamics. The GNN model can predict the deformation of clothing and is employed to calculate the deformation Jacobi matrix for control. Experiments using jackets validate the algorithm's effectiveness to recognize and fold complex clothing with self‑occlusion.

Abstract:
Precise segmentation of out‑of‑distribution (OoD) objects, herein referred to as anomalies, is crucial for the reliable deployment of semantic segmentation models in open‑set, safety‑critical applications, such as autonomous driving. Current anomalous segmentation benchmarks predominantly focus on favorable weather conditions, resulting in untrustworthy evaluations that overlook the risks posed by diverse meteorological conditions in open‑set environments, such as low illumination, dense fog, and heavy rain. To bridge this gap, this paper introduces the ComsAmy, a challenging benchmark specifically designed for open‑set anomaly segmentation in complex scenarios. ComsAmy encompasses a wide spectrum of adverse weather conditions, dynamic driving environments, and diverse anomaly types to comprehensively evaluate the model performance in realistic open‑world scenarios. Our extensive evaluation of several state‑of‑the‑art anomalous segmentation models reveals that existing methods demonstrate significant deficiencies in such challenging scenarios, highlighting their serious safety risks for real‑world deployment. To solve that, we propose a novel energy‑entropy learning (EEL) strategy that integrates the complementary information from energy and entropy to bolster the robustness of anomaly segmentation under complex open‑world environments. Additionally, a diffusion‑based anomalous training data synthesizer is proposed to generate diverse and high‑quality anomalous images to enhance the existing copy‑paste training data synthesizer. Extensive experimental results on both public and ComsAmy benchmarks demonstrate that our proposed diffusion‑based synthesizer with energy and entropy learning (DiffEEL) serves as an effective and generalizable plug‑and‑play method to enhance existing models, yielding an average improvement of around 4.96% in \rmAUPRC and 9.87% in \rmFPR_95.

Abstract:
Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text‑based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning‑based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure‑based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state‑of‑the‑art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code are publicly available.

Abstract:
Underwater instance segmentation is challenging due to adverse visual conditions such as light attenuation, scattering, and color distortion, which degrade model performance. In this work, we propose BARIS‑Decoder (Boundary‑Aware Refinement Decoder for Instance Segmentation), a framework that enhances segmentation accuracy through feature refinement. To address underwater degradations, we introduce the Environmental Robust Adapter (ERA), which efficiently models underwater degradation patterns while reducing trainable parameters by over 90% compared to full fine‑tuning. The integration of BARIS‑Decoder with ERA‑tuning, referred to as BARIS‑ERA, achieves state‑of‑the‑art performance, surpassing Mask R‑CNN by 3.4 mAP with a Swin‑B backbone and 3.8 mAP with ConvNeXt V2. Our findings demonstrate the effectiveness of BARIS‑ERA in advancing underwater instance segmentation, providing a robust and efficient solution.

Abstract:
Open‑vocabulary 3D scene understanding is pivotal for enhancing physical intelligence, as it enables embodied agents to interpret and interact dynamically within real‑world environments. This paper introduces MPEC, a novel Masked Point‑Entity Contrastive learning method for open‑vocabulary 3D semantic segmentation that leverages both 3D entity‑language alignment and point‑entity consistency across different point cloud views to foster entity‑specific feature representations. Our method improves semantic discrimination and enhances the differentiation of unique instances, achieving state‑of‑the‑art results on ScanNet for open‑vocabulary 3D semantic segmentation and demonstrating superior zero‑shot scene understanding capabilities. Extensive fine‑tuning experiments on 8 datasets, spanning from low‑level perception to high‑level reasoning tasks, showcase the potential of learned 3D features, driving consistent performance gains across varied 3D scene understanding tasks. Project website: https://mpec‑3d.github.io/

Abstract:
Semantic‑aware 3D scene reconstruction is essential for autonomous robots to perform complex interactions. Semantic SLAM, an online approach, integrates pose tracking, geometric reconstruction, and semantic mapping into a unified framework, shows significant potential. However, existing systems, which rely on 2D ground truth priors for supervision, are often limited by the sparsity and noise of these signals in real‑world environments. To address this challenge, we propose GSFF‑SLAM, a novel dense semantic SLAM system based on 3D Gaussian Splatting that leverages feature fields to achieve joint rendering of appearance, geometry, and N‑dimensional semantic features. By independently optimizing feature gradients, our method supports semantic reconstruction using various forms of 2D priors, particularly sparse and noisy signals. Experimental results demonstrate that our approach outperforms previous methods in both tracking accuracy and photorealistic rendering quality. When utilizing 2D ground truth priors, GSFF‑SLAM achieves state‑of‑the‑art semantic segmentation performance with 95.03% mIoU, while achieving up to 2.9× speedup with only marginal performance degradation.

Abstract:
Real‑time open‑vocabulary scene understanding is essential for efficient 3D perception in applications such as vision‑language navigation, embodied intelligence, and augmented reality. However, existing methods suffer from imprecise instance segmentation, static semantic updates, and limited handling of complex queries. To address these issues, we present OpenFusion++, a TSDF‑based real‑time 3D semantic‑geometric reconstruction system. Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual‑path encoding framework that integrates object attributes with environmental context for precise query responses. Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.

Abstract:
We introduce VISUALCENT, a unified human pose and instance segmentation framework to address generalizability and scalability limitations to multi person visual human analysis. VISUALCENT leverages centroid based bottom up keypoint detection paradigm and uses Keypoint Heatmap incorporating Disk Representation and KeyCentroid to identify the optimal keypoint coordinates. For the unified segmentation task, an explicit keypoint is defined as a dynamic centroid called MaskCentroid to swiftly cluster pixels to specific human instance during rapid changes in human body movement or significantly occluded environment. Experimental results on COCO and OCHuman datasets demonstrate VISUALCENTs accuracy and real time performance advantages, outperforming existing methods in mAP scores and execution frame rate per second. The implementation is available on the project page.

Abstract:
Autonomous Vehicles (AVs) require precise lane and object detection to ensure safe navigation. However, centralized deep learning (DL) approaches for semantic segmentation raise privacy and scalability challenges, particularly when handling sensitive data. This research presents a new federated learning (FL) framework that integrates secure deep Convolutional Neural Networks (CNNs) and Differential Privacy (DP) to address these issues. The core contribution of this work involves: (1) developing a new hybrid UNet‑ResNet34 architecture for centralized semantic segmentation to achieve high accuracy and tackle privacy concerns due to centralized training, and (2) implementing the privacy‑preserving FL model, distributed across AVs to enhance performance through secure CNNs and DP mechanisms. In the proposed FL framework, the methodology distinguishes itself from the existing approach through the following: (a) ensuring data decentralization through FL to uphold user privacy by eliminating the need for centralized data aggregation, (b) integrating DP mechanisms to secure sensitive model updates against potential adversarial inference attacks, and (c) evaluating the frameworks performance and generalizability using RGB and semantic segmentation datasets derived from the CARLA simulator. Experimental results show significant improvements in accuracy, from 81.5% to 88.7% for the RGB dataset and from 79.3% to 86.9% for the SEG dataset over 20 to 70 Communication Rounds (CRs). Global loss was reduced by over 60%, and minor accuracy trade‑offs from DP were observed. This study contributes by offering a scalable, privacy‑preserving FL framework tailored for AVs, optimizing communication efficiency while balancing performance and data security.

Abstract:
LiDAR‑based semantic segmentation is critical for autonomous trains, requiring accurate predictions across varying distances. This paper introduces two targeted data augmentation methods designed to improve segmentation performance on the railway‑specific OSDaR23 dataset. The person instance pasting method enhances segmentation of pedestrians at distant ranges by injecting realistic variations into the dataset. The track sparsification method redistributes point density in LiDAR scans, improving track segmentation at far distances with minimal impact on close‑range accuracy. Both methods are evaluated using a state‑of‑the‑art 3D semantic segmentation network, demonstrating significant improvements in distant‑range performance while maintaining robustness in close‑range predictions. We establish the first 3D semantic segmentation benchmark for OSDaR23, demonstrating the potential of data‑centric approaches to address railway‑specific challenges in autonomous train perception.

Abstract:
In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses challenges due to the wide variation and subjectivity of such clues, compounded by the problem of intra‑class variety in conveying abstract concepts, e.g. "enjoy life". Existing methods seek to solve the problem by manually designing representative features or building prototypes for each class from global features. However, these methods still struggle to deal with the large visual diversity of each intent category. In this paper, we introduce a novel approach named Multi‑grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition. Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi‑grained features. We adopt class‑specific prototypes to alleviate data imbalance. We treat intent recognition as a multi‑label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations. Demonstrated by a state‑of‑the‑art performance on the Intentonomy and MDID datasets, our approach advances the accuracy of existing methods while also possessing good interpretability. Our work provides an attempt for future explorations in understanding complex and miscellaneous forms of human expression.

Abstract:
Unsupervised Domain Adaptation (UDA) can improve a perception model's generalization to an unlabeled target domain starting from a labeled source domain. UDA using Vision Foundation Models (VFMs) with synthetic source data can achieve generalization performance comparable to fully‑supervised learning with real target data. However, because VFMs have strong generalization from their pre‑training, more straightforward, source‑only fine‑tuning can also perform well on the target. As data scenarios used in academic research are not necessarily representative for real‑world applications, it is currently unclear (a) how UDA behaves with more representative and diverse data and (b) if source‑only fine‑tuning of VFMs can perform equally well in these scenarios. Our research aims to close these gaps and, similar to previous studies, we focus on semantic segmentation as a representative perception task. We assess UDA for synth‑to‑real and real‑to‑real use cases with different source and target data combinations. We also investigate the effect of using a small amount of labeled target data in UDA. We clarify that while these scenarios are more realistic, they are not necessarily more challenging. Our results show that, when using stronger synthetic source data, UDA's improvement over source‑only fine‑tuning of VFMs reduces from +8 mIoU to +2 mIoU, and when using more diverse real source data, UDA has no added value. However, UDA generalization is always higher in all synthetic data scenarios than source‑only fine‑tuning and, when including only 1/16 of Cityscapes labels, synthetic UDA obtains the same state‑of‑the‑art segmentation quality of 85 mIoU as a fully‑supervised model using all labels. Considering the mixed results, we discuss how UDA can best support robust autonomous driving at scale.

Abstract:
Vision Transformers (ViTs) excel in semantic segmentation but demand significant computation, posing challenges for deployment on resource‑constrained devices. Existing token pruning methods often overlook fundamental visual data characteristics. This study introduces 'LVTP', a progressive token pruning framework guided by multi‑scale Tsallis entropy and low‑level visual features with twice clustering. It integrates high‑level semantics and basic visual attributes for precise segmentation. A novel dynamic scoring mechanism using multi‑scale Tsallis entropy weighting overcomes limitations of traditional single‑parameter entropy. The framework also incorporates low‑level feature analysis to preserve critical edge information while optimizing computational cost. As a plug‑and‑play module, it requires no architectural changes or additional training. Evaluations across multiple datasets show 20%‑45% computational reductions with negligible performance loss, outperforming existing methods in balancing cost and accuracy, especially in complex edge regions.

Abstract:
This paper presents a digital‑twin platform for active safety analysis in mixed traffic environments. The platform is built using a multi‑modal data‑enabled traffic environment constructed from drone‑based aerial LiDAR, OpenStreetMap, and vehicle sensor data (e.g., GPS and inclinometer readings). High‑resolution 3D road geometries are generated through AI‑powered semantic segmentation and georeferencing of aerial LiDAR data. To simulate real‑world driving scenarios, the platform integrates the CAR Learning to Act (CARLA) simulator, Simulation of Urban MObility (SUMO) traffic model, and NVIDIA PhysX vehicle dynamics engine. CARLA provides detailed micro‑level sensor and perception data, while SUMO manages macro‑level traffic flow. NVIDIA PhysX enables accurate modeling of vehicle behaviors under diverse conditions, accounting for mass distribution, tire friction, and center of mass. This integrated system supports high‑fidelity simulations that capture the complex interactions between autonomous and conventional vehicles. Experimental results demonstrate the platform's ability to reproduce realistic vehicle dynamics and traffic scenarios, enhancing the analysis of active safety measures. Overall, the proposed framework advances traffic safety research by enabling in‑depth, physics‑informed evaluation of vehicle behavior in dynamic and heterogeneous traffic environments.

Abstract:
The hematology analytics used for detection and classification of small blood components is a significant challenge. In particular, when objects exists as small pixel‑sized entities in a large context of similar objects. Deep learning approaches using supervised models with pre‑trained weights, such as residual networks and vision transformers have demonstrated success for many applications. Unfortunately, when applied to images outside the domain of learned representations, these methods often result with less than acceptable performance. A strategy to overcome this can be achieved by using self‑supervised models, where representations are learned and weights are then applied for downstream applications. Recently, masked autoencoders have proven to be effective to obtain representations that captures global context information. By masking regions of an image and having the model learn to reconstruct both the masked and non‑masked regions, weights can be used for various applications. However, if the sizes of the objects in images are less than the size of the mask, the global context information is lost, making it almost impossible to reconstruct the image. In this study, we investigated the effect of mask ratios and patch sizes for blood components using a MAE to obtain learned ViT encoder representations. We then applied the encoder weights to train a U‑Net Transformer for semantic segmentation to obtain both local and global contextual information. Our experimental results demonstrates that both smaller mask ratios and patch sizes improve the reconstruction of images using a MAE. We also show the results of semantic segmentation with and without pre‑trained weights, where smaller‑sized blood components benefited with pre‑training. Overall, our proposed method offers an efficient and effective strategy for the segmentation and classification of small objects.

Abstract:
In this thesis we discuss architectural designs and training methods for a neural network to have the ability of dissecting an image into objects of interest without supervision. The main challenge in 2D unsupervised object segmentation is distinguishing between foreground objects of interest and background. FlowCapsules uses motion as a cue for the objects of interest in 2D scenarios. The last part of this thesis focuses on 3D applications where the goal is detecting and removal of the object of interest from the input images. In these tasks, we leverage the geometric consistency of scenes in 3D to detect the inconsistent dynamic objects. Our transient object masks are then used for designing robust optimization kernels to improve 3D modelling in a casual capture setup. One of our goals in this thesis is to show the merits of unsupervised object based approaches in computer vision. Furthermore, we suggest possible directions for defining objects of interest or foreground objects without requiring supervision. Our hope is to motivate and excite the community into further exploring explicit object representations in image understanding tasks.

Abstract:
We propose a self‑supervised monocular depth estimation network tailored for endoscopic scenes, aiming to infer depth within the gastrointestinal tract from monocular images. Existing methods, though accurate, typically assume consistent illumination, which is often violated due to dynamic lighting and occlusions caused by GI motility. These variations lead to incorrect geometric interpretations and unreliable self‑supervised signals, degrading depth reconstruction quality. To address this, we introduce an occlusion‑aware self‑supervised framework. First, we incorporate an occlusion mask for data augmentation, generating pseudo‑labels by simulating viewpoint‑dependent occlusion scenarios. This enhances the model's ability to learn robust depth features under partial visibility. Second, we leverage semantic segmentation guided by non‑negative matrix factorization, clustering convolutional activations to generate pseudo‑labels in texture‑deprived regions, thereby improving segmentation accuracy and mitigating information loss from lighting changes. Experimental results on the SCARED dataset show that our method achieves state‑of‑the‑art performance in self‑supervised depth estimation. Additionally, evaluations on the Endo‑SLAM and SERV‑CT datasets demonstrate strong generalization across diverse endoscopic environments.

Abstract:
Skin, the primary regulator of heat exchange, relies on sweat glands for thermoregulation. Alterations in sweat gland morphology play a crucial role in various pathological conditions and clinical diagnoses. Current methods for observing sweat gland morphology are limited by their two‑dimensional, in vitro, and destructive nature, underscoring the urgent need for real‑time, non‑invasive, quantifiable technologies. We proposed a novel three‑dimensional (3D) transformer‑based multi‑object segmentation framework, integrating a sliding window approach, joint spatial‑channel attention mechanism, and architectural heterogeneity between shallow and deep layers. Our proposed network enables precise 3D sweat gland segmentation from skin volume data captured by optical coherence tomography (OCT). For the first time, subtle variations of sweat gland 3D morphology in response to temperature changes, have been visualized and quantified. Our approach establishes a benchmark for normal sweat gland morphology and provides a real‑time, non‑invasive tool for quantifying 3D structural parameters. This enables the study of individual variability and pathological changes in sweat gland structure, advancing dermatological research and clinical applications, including thermoregulation and bromhidrosis treatment.

Abstract:
Generative image models are increasingly being used for training data augmentation in vision tasks. In the context of automotive object detection, methods usually focus on producing augmented frames that look as realistic as possible, for example by replacing real objects with generated ones. Others try to maximize the diversity of augmented frames, for example by pasting lots of generated objects onto existing backgrounds. Both perspectives pay little attention to the locations of objects in the scene. Frame layouts are either reused with little or no modification, or they are random and disregard realism entirely. In this work, we argue that optimal data augmentation should also include realistic augmentation of layouts. We introduce a scene‑aware probabilistic location model that predicts where new objects can realistically be placed in an existing scene. By then inpainting objects in these locations with a generative model, we obtain much stronger augmentation performance than existing approaches. We set a new state of the art for generative data augmentation on two automotive object detection tasks, achieving up to 2.8× higher gains than the best competing approach (+1.4 vs. +0.5 mAP boost). We also demonstrate significant improvements for instance segmentation.

Abstract:
Semantic segmentation of remote sensing imagery demands precise spatial boundaries and robust intra‑class consistency, challenging conventional hierarchical models. To address limitations arising from spatial domain feature fusion and insufficient receptive fields, this paper introduces SAIP‑Net, a novel frequency‑aware segmentation framework that leverages Spectral Adaptive Information Propagation. SAIP‑Net employs adaptive frequency filtering and multi‑scale receptive field enhancement to effectively suppress intra‑class feature inconsistencies and sharpen boundary lines. Comprehensive experiments demonstrate significant performance improvements over state‑of‑the‑art methods, highlighting the effectiveness of spectral‑adaptive strategies combined with expanded receptive fields for remote sensing image segmentation.

Abstract:
We introduce ROAR (Robust Object Removal and Re‑annotation), a scalable framework for privacy‑preserving dataset obfuscation that eliminates sensitive objects instead of modifying them. Our method integrates instance segmentation with generative inpainting to remove identifiable entities while preserving scene integrity. Extensive evaluations on 2D COCO‑based object detection show that ROAR achieves 87.5% of the baseline detection average precision (AP), whereas image dropping achieves only 74.2% of the baseline AP, highlighting the advantage of scrubbing in preserving dataset utility. The degradation is even more severe for small objects due to occlusion and loss of fine‑grained details. Furthermore, in NeRF‑based 3D reconstruction, our method incurs a PSNR loss of at most 1.66 dB while maintaining SSIM and improving LPIPS, demonstrating superior perceptual quality. Our findings establish object removal as an effective privacy framework, achieving strong privacy guarantees with minimal performance trade‑offs. The results highlight key challenges in generative inpainting, occlusion‑robust segmentation, and task‑specific scrubbing, setting the foundation for future advancements in privacy‑preserving vision systems.

Abstract:
The RGB‑Depth (RGB‑D) Video Object Segmentation (VOS) aims to integrate the fine‑grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off‑the‑shelf RGB‑D segmentation methods fail to fully explore cross‑modal information and suffer from object drift during long‑term prediction. In this paper, we propose a novel RGB‑D VOS method via multi‑store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio‑temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB‑D VOS. Experimental results show that the proposed method achieves state‑of‑the‑art performance on the latest RGB‑D VOS benchmark.

Abstract:
In recent years, the application of Deep Learning techniques has shown remarkable success in various computer vision tasks, paving the way for their deployment in extraterrestrial exploration. Transfer learning has emerged as a powerful strategy for addressing the scarcity of labeled data in these novel environments. This paper represents one of the first efforts in evaluating the feasibility of employing adapters toward efficient transfer learning for rock segmentation in extraterrestrial landscapes, mainly focusing on lunar and martian terrains. Our work suggests that the use of adapters, strategically integrated into a pre‑trained backbone model, can be successful in reducing both bandwidth and memory requirements for the target extraterrestrial device. In this study, we considered two memory‑saving strategies: layer fusion (to reduce to zero the inference overhead) and an ``adapter ranking'' (to also reduce the transmission cost). Finally, we evaluate these results in terms of task performance, memory, and computation on embedded devices, evidencing trade‑offs that open the road to more research in the field.

Abstract:
Few‑shot semantic segmentation has attracted growing interest for its ability to generalize to novel object categories using only a few annotated samples. To address data scarcity, recent methods incorporate multiple foundation models to improve feature transferability and segmentation performance. However, they often rely on dual‑branch architectures that combine pre‑trained encoders to leverage complementary strengths, a design that limits flexibility and efficiency. This raises a fundamental question: can we build a unified model that integrates knowledge from different foundation architectures? Achieving this is, however, challenging due to the misalignment between class‑agnostic segmentation capabilities and fine‑grained discriminative representations. To this end, we present UINO‑FSS, a novel framework built on the key observation that early‑stage DINOv2 features exhibit distribution consistency with SAM's output embeddings. This consistency enables the integration of both models' knowledge into a single‑encoder architecture via coarse‑to‑fine multimodal distillation. In particular, our segmenter consists of three core components: a bottleneck adapter for embedding alignment, a meta‑visual prompt generator that leverages dense similarity volumes and semantic embeddings, and a mask decoder. Using hierarchical cross‑model distillation, we effectively transfer SAM's knowledge into the segmenter, further enhanced by Mamba‑based 4D correlation mining on support‑query pairs. Extensive experiments on PASCAL‑5^i and COCO‑20^i show that UINO‑FSS achieves new state‑of‑the‑art results under the 1‑shot setting, with mIoU of 80.6 (+3.8%) on PASCAL‑5^i and 64.5 (+4.1%) on COCO‑20^i, demonstrating the effectiveness of our unified approach.

Abstract:
Vision‑language models (VLMs) have demonstrated impressive zero‑shot transfer capabilities in image‑level visual perception tasks. However, they fall short in 3D instance‑level segmentation tasks that require accurate localization and recognition of individual objects. To bridge this gap, we introduce a novel 3D Gaussian Splatting based hard visual prompting approach that leverages camera interpolation to generate diverse viewpoints around target objects without any 2D‑3D optimization or fine‑tuning. Our method simulates realistic 3D perspectives, effectively augmenting existing hard visual prompts by enforcing geometric consistency across viewpoints. This training‑free strategy seamlessly integrates with prior hard visual prompts, enriching object‑descriptive features and enabling VLMs to achieve more robust and accurate 3D instance segmentation in diverse 3D scenes.

Abstract:
Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross‑modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR‑based 3D semantic segmentation. Our method consumes paired 2D‑3D (image and point cloud) data and relies on the robust (cross‑domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state‑of‑the‑art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state‑of‑the‑art.

Abstract:
Image‑based crack detection algorithms are increasingly in demand in infrastructure monitoring, as early detection of cracks is of paramount importance for timely maintenance planning. While deep learning has significantly advanced crack detection algorithms, existing models often require extensive labeled datasets and high computational costs for fine‑tuning, limiting their adaptability across diverse conditions. This study introduces an efficient selective fine‑tuning strategy, focusing on tuning normalization components, to enhance the adaptability of segmentation models for crack detection. The proposed method is applied to the Segment Anything Model (SAM) and five well‑established segmentation models. Experimental results demonstrate that selective fine‑tuning of only normalization parameters outperforms full fine‑tuning and other common fine‑tuning techniques in both performance and computational efficiency, while improving generalization. The proposed approach yields a SAM‑based model, Segment Any Crack (SAC), achieving a 61.22% F1‑score and 44.13% IoU on the OmniCrack30k benchmark dataset, along with the highest performance across three zero‑shot datasets and the lowest standard deviation. The results highlight the effectiveness of the adaptation approach in improving segmentation accuracy while significantly reducing computational overhead.

Abstract:
Road environment segmentation plays a significant role in autonomous driving. Numerous works based on Fully Convolutional Networks (FCNs) and Transformer architectures have been proposed to leverage local and global contextual learning for efficient and accurate semantic segmentation. In both architectures, the encoder often relies heavily on extracting continuous representations from the image, which limits the ability to represent meaningful discrete information. To address this limitation, we propose segmentation of the autonomous driving environment using vector quantization. Vector quantization offers three primary advantages for road environment segmentation. (1) Each continuous feature from the encoder is mapped to a discrete vector from the codebook, helping the model discover distinct features more easily than with complex continuous features. (2) Since a discrete feature acts as compressed versions of the encoder's continuous features, they also compress noise or outliers, enhancing the image segmentation task. (3) Vector quantization encourages the latent space to form coarse clusters of continuous features, forcing the model to group similar features, making the learned representations more structured for the decoding process. In this work, we combined vector quantization with the lightweight image segmentation model MobileUNETR and used it as a baseline model for comparison to demonstrate its efficiency. Through experiments, we achieved 77.0 % mIoU on Cityscapes, outperforming the baseline by 2.9 % without increasing the model's initial size or complexity.

Abstract:
Standard semantic instance segmentation provides useful, but inherently 2D information from a single image. To enable 3D analysis, one usually integrates absolute monocular depth estimation with instance segmentation. However, monocular depth is a difficult task. Instead, we leverage a simpler single‑image task, occlusion‑based relative depth ordering, providing coarser but useful 3D information. We show that relative depth ordering works more reliably from occlusions than from absolute depth. We propose to solve the joint task of relative depth ordering and segmentation of instances based on occlusions. We call this task Occlusion‑Ordered Semantic Instance Segmentation (OOSIS). We develop an approach to OOSIS that extracts instances and their occlusion order simultaneously from oriented occlusion boundaries and semantic segmentation. Unlike popular detect‑and‑segment framework for instance segmentation, combining occlusion ordering with instance segmentation allows a simple and clean formulation of OOSIS as a labeling problem. As a part of our solution for OOSIS, we develop a novel oriented occlusion boundaries approach that significantly outperforms prior work. We also develop a new joint OOSIS metric based both on instance mask accuracy and correctness of their occlusion order. We achieve better performance than strong baselines on KINS and COCOA datasets.

Abstract:
Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS‑RVOS, a Transformer‑based model with two key components: a cross‑modal affinity module and an instance sequence matching strategy, which extends FS‑RVOS to multi‑object segmentation (FS‑RVMOS). Experiments show FS‑RVOS and FS‑RVMOS outperform state‑of‑the‑art methods across diverse benchmarks, demonstrating superior robustness and accuracy.

Abstract:
We propose an enhanced deep learning‑based model for image segmentation of the left and right ventricles and myocardium scar tissue from cardiac magnetic resonance (CMR) images. The proposed technique integrates UNet, channel and spatial attention, edge‑detection based skip‑connection and deep supervised learning to improve the accuracy of the CMR image‑segmentation. Images are processed using multiple channels to generate multiple feature‑maps. We built a dual attention‑based model to integrate channel and spatial attention. The use of extracted edges in skip connection improves the reconstructed images from feature‑maps. The use of deep supervision reduces vanishing gradient problems inherent in classification based on deep neural networks. The algorithms for dual attention‑based model, corresponding implementation and performance results are described. The performance results show that this approach has attained high accuracy: 98% Dice Similarity Score (DSC) and significantly lower Hausdorff Distance (HD). The performance results outperform other leading techniques both in DSC and HD.

Abstract:
Automated noninvasive cardiac diagnosis plays a critical role in the early detection of cardiac disorders and cost‑effective clinical management. Automated diagnosis involves the automated segmentation and analysis of cardiac images. Precise delineation of cardiac substructures and extraction of their morphological attributes are essential for evaluating the cardiac function, and diagnosing cardiovascular disease such as cardiomyopathy, valvular diseases, abnormalities related to septum perforations, and blood‑flow rate. Semantic segmentation labels the CMR image at the pixel level, and localizes its subcomponents to facilitate the detection of abnormalities, including abnormalities in cardiac wall motion in an aging heart with muscle abnormalities, vascular abnormalities, and valvular abnormalities. In this paper, we describe a model to improve semantic segmentation of CMR images. The model extracts edge‑attributes and context information during down‑sampling of the U‑Net and infuses this information during up‑sampling to localize three major cardiac structures: left ventricle cavity (LV); right ventricle cavity (RV); and LV myocardium (LMyo). We present an algorithm and performance results. A comparison of our model with previous leading models, using similarity metrics between actual image and segmented image, shows that our approach improves Dice similarity coefficient (DSC) by 2%‑11% and lowers Hausdorff distance (HD) by 1.6 to 5.7 mm.

Abstract:
This survey examines recent advances in generating digital twins from visual data. These digital twins ‑ virtual 3D replicas of physical assets ‑ can be applied to robotics, media content creation, design or construction workflows. We analyze a range of approaches, including 3D Gaussian Splatting, generative inpainting, semantic segmentation, and foundation models, highlighting their respective advantages and limitations. In addition, we discuss key challenges such as occlusions, lighting variations, and scalability, as well as identify gaps, trends, and directions for future research. Overall, this survey aims to provide a comprehensive overview of state‑of‑the‑art methodologies and their implications for real‑world applications. Awesome Digital Twin: https://awesomedigitaltwin.github.io

Abstract:
Generative Adversarial Network (GAN) inversion have demonstrated excellent performance in image inpainting that aims to restore lost or damaged image texture using its unmasked content. Previous GAN inversion‑based methods usually utilize well‑trained GAN models as effective priors to generate the realistic regions for missing holes. Despite excellence, they ignore a hard constraint that the unmasked regions in the input and the output should be the same, resulting in a gap between GAN inversion and image inpainting and thus degrading the performance. Besides, existing GAN inversion approaches often consider a single modality of the input image, neglecting other auxiliary cues in images for improvements. Addressing these problems, we propose a novel GAN inversion approach, dubbed MMInvertFill, for image inpainting. MMInvertFill contains primarily a multimodal guided encoder with a pre‑modulation and a GAN generator with F&W+ latent space. Specifically, the multimodal encoder aims to enhance the multi‑scale structures with additional semantic segmentation edge texture modalities through a gated mask‑aware attention module. Afterwards, a pre‑modulation is presented to encode these structures into style vectors. To mitigate issues of conspicuous color discrepancy and semantic inconsistency, we introduce the F&W+ latent space to bridge the gap between GAN inversion and image inpainting. Furthermore, in order to reconstruct faithful and photorealistic images, we devise a simple yet effective Soft‑update Mean Latent module to capture more diversified in‑domain patterns for generating high‑fidelity textures for massive corruptions. In our extensive experiments on six challenging datasets, we show that our MMInvertFill qualitatively and quantitatively outperforms other state‑of‑the‑arts and it supports the completion of out‑of‑domain images effectively.

Abstract:
Labeling has always been expensive in the medical context, which has hindered related deep learning application. Our work introduces active learning in surgical video frame selection to construct a high‑quality, affordable Laparoscopic Cholecystectomy dataset for semantic segmentation. Active learning allows the Deep Neural Networks (DNNs) learning pipeline to include the dataset construction workflow, which means DNNs trained by existing dataset will identify the most informative data from the newly collected data. At the same time, DNNs' performance and generalization ability improve over time when the newly selected and annotated data are included in the training data. We assessed different data informativeness measurements and found the deep features distances select the most informative data in this task. Our experiments show that with half of the data selected by active learning, the DNNs achieve almost the same performance with 0.4349 mean Intersection over Union (mIoU) compared to the same DNNs trained on the full dataset (0.4374 mIoU) on the critical anatomies and surgical instruments.

Abstract:
The operating room (OR) is a complex environment where optimizing workflows is critical to reduce costs and improve patient outcomes. While computer vision approaches for automatic recognition of perioperative events can identify bottlenecks for OR optimization, privacy concerns limit the use of OR videos for automated event detection. We propose a two‑stage pipeline for privacy‑preserving OR video analysis and event detection. First, we leverage vision foundation models for depth estimation and semantic segmentation to generate de‑identified Digital Twins (DT) of the OR from conventional RGB videos. Second, we employ the SafeOR model, a fused two‑stream approach that processes segmentation masks and depth maps for OR event detection. Evaluation on an internal dataset of 38 simulated surgical trials with five event classes shows that our DT‑based approach achieves performance on par with ‑‑ and sometimes better than ‑‑ raw RGB video‑based models for OR event detection. Digital Twins enable privacy‑preserving OR workflow analysis, facilitating the sharing of de‑identified data across institutions and potentially enhancing model generalizability by mitigating domain‑specific appearance differences.

Abstract:
Identifying spatial regions where biodiversity is threatened is crucial for effective ecosystem conservation and monitoring. In this stydy, we assessed varios machine learning methods to detect grazing trails automatically. We tested five semantic segmentation models combined with 14 different encoder networks. The best combination was UNet with MambaOut encoder. The solution proposed could be used as the basis for tools aiming at mapping and tracking changes in grazing trails on a continuous temporal basis.

Abstract:
Biomedical images often contain objects known to be spatially correlated or nested due to their inherent properties, leading to semantic relations. Examples include cell nuclei being nested within eukaryotic cells and colonies growing exclusively within their culture dishes. While these semantic relations bear key importance, detection tasks are often formulated independently, requiring multi‑shot analysis pipelines. Importantly, spatial correlation could constitute a fundamental prior facilitating learning of more meaningful representations for tasks like instance segmentation. This knowledge has, thus far, not been utilised by the biomedical computer vision community. We argue that the instance segmentation of two or more categories of objects can be achieved in parallel. We achieve this via two architectures HydraStarDist (HSD) and the novel (HSD‑WBR) based on the widely‑used StarDist (SD), to take advantage of the star‑convexity of our target objects. HSD and HSD‑WBR are constructed to be capable of incorporating their interactions as constraints into account. HSD implicitly incorporates spatial correlation priors based on object interaction through a joint encoder. HSD‑WBR further enforces the prior in a regularisation layer with the penalty we proposed named Within Boundary Regularisation Penalty (WBR). Both architectures achieve nested instance segmentation in a single shot. We demonstrate their competitiveness based on IoU_R and AP and superiority in a new, task‑relevant criteria, Joint TP rate (JTPR) compared to their baseline SD and Cellpose. Our approach can be further modified to capture partial‑inclusion/‑exclusion in multi‑object interactions in fluorescent or brightfield microscopy or digital imaging. Finally, our strategy suggests gains by making this learning single‑shot and computationally efficient.

Abstract:
Computer vision models have seen increased usage in sports, and reinforcement learning (RL) is famous for beating humans in strategic games such as Chess and Go. In this paper, we are interested in building upon these advances and examining the game of classic 8‑ball pool. We introduce pix2pockets, a foundation for an RL‑assisted pool coach. Given a single image of a pool table, we first aim to detect the table and the balls and then propose the optimal shot suggestion. For the first task, we build a dataset with 195 diverse images where we manually annotate all balls and table dots, leading to 5748 object segmentation masks. For the second task, we build a standardized RL environment that allows easy development and benchmarking of any RL algorithm. Our object detection model yields an AP50 of 91.2 while our ball location pipeline obtains an error of only 0.4 cm. Furthermore, we compare standard RL algorithms to set a baseline for the shot suggestion task and we show that all of them fail to pocket all balls without making a foul move. We also present a simple baseline that achieves a per‑shot success rate of 94.7% and clears a full game in a single turn 30% of the time.

Abstract:
Open‑vocabulary 3D scene understanding is crucial for applications requiring natural language‑driven spatial interpretation, such as robotics and augmented reality. While 3D Gaussian Splatting (3DGS) offers a powerful representation for scene reconstruction, integrating it with open‑vocabulary frameworks reveals a key challenge: cross‑view granularity inconsistency. This issue, stemming from 2D segmentation methods like SAM, results in inconsistent object segmentations across views (e.g., a "coffee set" segmented as a single entity in one view but as "cup + coffee + spoon" in another). Existing 3DGS‑based methods often rely on isolated per‑Gaussian feature learning, neglecting the spatial context needed for cohesive object reasoning, leading to fragmented representations. We propose Context‑Aware Gaussian Splatting (CAGS), a novel framework that addresses this challenge by incorporating spatial context into 3DGS. CAGS constructs local graphs to propagate contextual features across Gaussians, reducing noise from inconsistent granularity, employs mask‑centric contrastive learning to smooth SAM‑derived features across views, and leverages a precomputation strategy to reduce computational cost by precomputing neighborhood relationships, enabling efficient training in large‑scale scenes. By integrating spatial context, CAGS significantly improves 3D instance segmentation and reduces fragmentation errors on datasets like LERF‑OVS and ScanNet, enabling robust language‑guided 3D scene understanding.

Abstract:
Existing computer vision(CV)‑based structural damage identification models demonstrate notable accuracy in categorizing and localizing damage. However, these models present several critical limitations that hinder their practical application in civil engineering(CE). Primarily, their ability to recognize damage types remains constrained, preventing comprehensive analysis of the highly varied and complex conditions encountered in real‑world CE structures. Second, these models lack linguistic capabilities, rendering them unable to articulate structural damage characteristics through natural language descriptions. With the continuous advancement of artificial intelligence(AI), large multi‑modal models(LMMs) have emerged as a transformative solution, enabling the unified encoding and alignment of textual and visual data. These models can autonomously generate detailed descriptive narratives of structural damage while demonstrating robust generalization across diverse scenarios and tasks. This study introduces SDIGLM, an innovative LMM for structural damage identification, developed based on the open‑source VisualGLM‑6B architecture. To address the challenge of adapting LMMs to the intricate and varied operating conditions in CE, this work integrates a U‑Net‑based semantic segmentation module to generate defect segmentation maps as visual Chain of Thought(CoT). Additionally, a multi‑round dialogue fine‑tuning dataset is constructed to enhance logical reasoning, complemented by a language CoT formed through prompt engineering. By leveraging this multi‑modal CoT, SDIGLM surpasses general‑purpose LMMs in structural damage identification, achieving an accuracy of 95.24% across various infrastructure types. Moreover, the model effectively describes damage characteristics such as hole size, crack direction, and corrosion severity.

Abstract:
This paper tackles category‑level pose estimation of articulated objects in robotic manipulation tasks and introduces a new benchmark dataset. While recent methods estimate part poses and sizes at the category level, they often rely on geometric cues and complex multi‑stage pipelines that first segment parts from the point cloud, followed by Normalized Part Coordinate Space (NPCS) estimation for 6D poses. These approaches overlook dense semantic cues from RGB images, leading to suboptimal accuracy, particularly for objects with small parts. To address these limitations, we propose a single‑stage Network, CAP‑Net, for estimating the 6D poses and sizes of Categorical Articulated Parts. This method combines RGB‑D features to generate instance segmentation and NPCS representations for each part in an end‑to‑end manner. CAP‑Net uses a unified network to simultaneously predict point‑wise class labels, centroid offsets, and NPCS maps. A clustering algorithm then groups points of the same predicted class based on their estimated centroid distances to isolate each part. Finally, the NPCS region of each part is aligned with the point cloud to recover its final pose and size. To bridge the sim‑to‑real domain gap, we introduce the RGBD‑Art dataset, the largest RGB‑D articulated dataset to date, featuring photorealistic RGB images and depth noise simulated from real sensors. Experimental evaluations on the RGBD‑Art dataset demonstrate that our method significantly outperforms the state‑of‑the‑art approach. Real‑world deployments of our model in robotic tasks underscore its robustness and exceptional sim‑to‑real transfer capabilities, confirming its substantial practical utility. Our dataset, code and pre‑trained models are available on the project page.

Abstract:
Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land‑use change detection. Nevertheless, their real‑time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time‑critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search‑and‑rescue, and cultural heritage monitoring. LightFormer employs a feature‑fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi‑scale, multi‑range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long‑range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet's mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy‑efficiency trade‑off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high‑precision segmentation are imperative.

Abstract:
Remote sensing images are widely utilized in many disciplines such as feature recognition and scene semantic segmentation. However, due to environmental factors and the issues of the imaging system, the image quality is often degraded which may impair subsequent visual tasks. Even though denoising remote sensing images plays an essential role before applications, the current denoising algorithms fail to attain optimum performance since these images possess complex features in the texture. Denoising frameworks based on artificial neural networks have shown better performance; however, they require exhaustive training with heterogeneous samples that extensively consume resources like power, memory, computation, and latency. Thus, here we present a computationally efficient and robust remote sensing image denoising method that doesn't require additional training samples. This method partitions patches of a remote‑sensing image in which a low‑rank manifold, representing the noise‑free version of the image, underlies the patch space. An efficient and robust approach to revealing this manifold is a randomized approximation of the singular value spectrum of the geodesics' Gramian matrix of the patch space. The method asserts a unique emphasis on each color channel during denoising so the three denoised channels are merged to produce the final image.

Abstract:
Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning‑based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R‑CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts

Abstract:
Complex video object segmentation continues to face significant challenges in small object recognition, occlusion handling, and dynamic scene modeling. This report presents our solution, which ranked second in the MOSE track of CVPR 2025 PVUW Challenge. Based on an existing segmentation framework, we propose an improved model named MASSeg for complex video object segmentation, and construct an enhanced dataset, MOSE+, which includes typical scenarios with occlusions, cluttered backgrounds, and small target instances. During training, we incorporate a combination of inter‑frame consistent and inconsistent data augmentation strategies to improve robustness and generalization. During inference, we design a mask output scaling strategy to better adapt to varying object sizes and occlusion levels. As a result, MASSeg achieves a J score of 0.8250, F score of 0.9007, and a J&F score of 0.8628 on the MOSE test set.

Abstract:
Road damage can create safety and comfort challenges for both human drivers and autonomous vehicles (AVs). This damage is particularly prevalent in rural areas due to less frequent surveying and maintenance of roads. Automated detection of pavement deterioration can be used as an input to AVs and driver assistance systems to improve road safety. Current research in this field has predominantly focused on urban environments driven largely by public datasets, while rural areas have received significantly less attention. This paper introduces M2S‑RoAD, a dataset for the semantic segmentation of different classes of road damage. M2S‑RoAD was collected in various towns across New South Wales, Australia, and labelled for semantic segmentation to identify nine distinct types of road damage. This dataset will be released upon the acceptance of the paper.

Abstract:
Unsupervised Domain Adaptation (UDA) is essential for enabling semantic segmentation in new domains without requiring costly pixel‑wise annotations. State‑of‑the‑art (SOTA) UDA methods primarily use self‑training with architecturally identical teacher and student networks, relying on Exponential Moving Average (EMA) updates. However, these approaches face substantial performance degradation with lightweight models due to inherent architectural inflexibility leading to low‑quality pseudo‑labels. To address this, we propose Distilled Unsupervised Domain Adaptation (DUDA), a novel framework that combines EMA‑based self‑training with knowledge distillation (KD). Our method employs an auxiliary student network to bridge the architectural gap between heavyweight and lightweight models for EMA‑based updates, resulting in improved pseudo‑label quality. DUDA employs a strategic fusion of UDA and KD, incorporating innovative elements such as gradual distillation from large to small networks, inconsistency loss prioritizing poorly adapted classes, and learning with multiple teachers. Extensive experiments across four UDA benchmarks demonstrate DUDA's superiority in achieving SOTA performance with lightweight models, often surpassing the performance of heavyweight models from other approaches.

Abstract:
Semi‑Supervised Semantic Segmentation (SSSS) aims to improve segmentation accuracy by leveraging a small set of labeled images alongside a larger pool of unlabeled data. Recent advances primarily focus on pseudo‑labeling, consistency regularization, and co‑training strategies. However, existing methods struggle to balance global semantic representation with fine‑grained local feature extraction. To address this challenge, we propose a novel tri‑branch semi‑supervised segmentation framework incorporating a dual‑teacher strategy, named IGL‑DT. Our approach employs SwinUnet for high‑level semantic guidance through Global Context Learning and ResUnet for detailed feature refinement via Local Regional Learning. Additionally, a Discrepancy Learning mechanism mitigates over‑reliance on a single teacher, promoting adaptive feature learning. Extensive experiments on benchmark datasets demonstrate that our method outperforms state‑of‑the‑art approaches, achieving superior segmentation performance across various data regimes.

Abstract:
Radio Frequency Interference (RFI) from anthropogenic radio sources poses significant challenges to current and future radio telescopes. Contemporary approaches to detecting RFI treat the task as a semantic segmentation problem on radio telescope spectrograms. Typically, complex heuristic algorithms handle this task of `flagging' in combination with manual labeling (in the most difficult cases). While recent machine‑learning approaches have demonstrated high accuracy, they often fail to meet the stringent operational requirements of modern radio observatories. Owing to their inherently time‑varying nature, spiking neural networks (SNNs) are a promising alternative method to RFI‑detection by utilizing the time‑varying nature of the spectrographic source data. In this work, we apply Liquid State Machines (LSMs), a class of spiking neural networks, to RFI‑detection. We employ second‑order Leaky Integrate‑and‑Fire (LiF) neurons, marking the first use of this architecture and neuron type for RFI‑detection. We test three encoding methods and three increasingly complex readout layers, including a transformer decoder head, providing a hybrid of SNN and ANN techniques. Our methods extend LSMs beyond conventional classification tasks to fine‑grained spatio‑temporal segmentation. We train LSMs on simulated data derived from the Hyrogen Epoch of Reionization Array (HERA), a known benchmark for RFI‑detection. Our model achieves a per‑pixel accuracy of 98% and an F1‑score of 0.743, demonstrating competitive performance on this highly challenging task. This work expands the sophistication of SNN techniques and architectures applied to RFI‑detection, and highlights the effectiveness of LSMs in handling fine‑grained, complex, spatio‑temporal signal‑processing tasks.

Abstract:
Video Object Segmentation (VOS) is one of the most fundamental and challenging tasks in computer vision and has a wide range of applications. Most existing methods rely on spatiotemporal memory networks to extract frame‑level features and have achieved promising results on commonly used datasets. However, these methods often struggle in more complex real‑world scenarios. This paper addresses this issue, aiming to achieve accurate segmentation of video objects in challenging scenes. We propose fine‑tuning VOS (FVOS), optimizing existing methods for specific datasets through tailored training. Additionally, we introduce a morphological post‑processing strategy to address the issue of excessively large gaps between adjacent objects in single‑model predictions. Finally, we apply a voting‑based fusion method on multi‑scale segmentation results to generate the final output. Our approach achieves J&F scores of 76.81% and 83.92% during the validation and testing stages, respectively, securing third place overall in the MOSE Track of the 4th PVUW challenge 2025.

Abstract:
Existing Masked Image Modeling methods apply fixed mask patterns to guide the self‑supervised training. As those mask patterns resort to different criteria to depict image contents, sticking to a fixed pattern leads to a limited vision cues modeling capability.This paper introduces an evolved hierarchical masking method to pursue general visual cues modeling in self‑supervised learning. The proposed method leverages the vision model being trained to parse the input visual cues into a hierarchy structure, which is hence adopted to generate masks accordingly. The accuracy of hierarchy is on par with the capability of the model being trained, leading to evolved mask patterns at different training stages. Initially, generated masks focus on low‑level visual cues to grasp basic textures, then gradually evolve to depict higher‑level cues to reinforce the learning of more complicated object semantics and contexts. Our method does not require extra pre‑trained models or annotations and ensures training efficiency by evolving the training difficulty. We conduct extensive experiments on seven downstream tasks including partial‑duplicate image retrieval relying on low‑level details, as well as image classification and semantic segmentation that require semantic parsing capability. Experimental results demonstrate that it substantially boosts performance across these tasks. For instance, it surpasses the recent MAE by 1.1% in imageNet‑1K classification and 1.4% in ADE20K segmentation with the same training epochs. We also align the proposed method with the current research focus on LLMs. The proposed approach bridges the gap with large‑scale pre‑training on semantic demanding tasks and enhances intricate detail perception in tasks requiring low‑level feature recognition.

Abstract:
Effective leveraging of real‑world driving datasets is crucial for enhancing the training of autonomous driving systems. While Offline Reinforcement Learning enables training autonomous vehicles with such data, most available datasets lack meaningful reward labels. Reward labeling is essential as it provides feedback for the learning algorithm to distinguish between desirable and undesirable behaviors, thereby improving policy performance. This paper presents a novel approach for generating human‑aligned reward labels. The proposed approach addresses the challenge of absent reward signals in the real‑world datasets by generating labels that reflect human judgment and safety considerations. The reward function incorporates an adaptive safety component that is activated by analyzing semantic segmentation maps, enabling the autonomous vehicle to prioritize safety over efficiency in potential collision scenarios. The proposed method is applied to an occluded pedestrian crossing scenario with varying pedestrian traffic levels, using simulation data. When the generated rewards were used to train various Offline Reinforcement Learning algorithms, each model produced a meaningful policy, demonstrating the method's viability. In addition, the method was applied to a subset of the Audi Autonomous Driving Dataset, and the reward labels were compared to human‑annotated reward labels. The findings show a moderate disparity between the two reward sets, and, most interestingly, the method flagged unsafe states that the human annotator missed.

Abstract:
Recent advancements in text‑to‑video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine‑tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video‑MSG, a training‑free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video‑MSG consists of three steps, where in the first two steps, Video‑MSG creates Video Sketch, a fine‑grained spatio‑temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video‑MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video‑MSG does not need fine‑tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video‑MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX‑5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

Abstract:
Generating synthetic images is a useful method for cheaply obtaining labeled data for training computer vision models. However, obtaining accurate 3D models of relevant objects is necessary, and the resulting images often have a gap in realism due to challenges in simulating lighting effects and camera artifacts. We propose using the novel view synthesis method called Gaussian Splatting to address these challenges. We have developed a synthetic data pipeline for generating high‑quality context‑aware instance segmentation training data for specific objects. This process is fully automated, requiring only a video of the target object. We train a Gaussian Splatting model of the target object and automatically extract the object from the video. Leveraging Gaussian Splatting, we then render the object on a random background image, and monocular depth estimation is employed to place the object in a believable pose. We introduce a novel dataset to validate our approach and show superior performance over other data generation approaches, such as Cut‑and‑Paste and Diffusion model‑based generation.

Abstract:
Segmentation of video objects in complex scenarios is highly challenging, and the MOSE dataset has significantly contributed to the development of this field. This technical report details the STSeg solution proposed by the "imaplus" team.By finetuning SAM2 and the unsupervised model TMO on the MOSE dataset, the STSeg solution demonstrates remarkable advantages in handling complex object motions and long‑video sequences. In the inference phase, an Adaptive Pseudo‑labels Guided Model Refinement Pipeline is adopted to intelligently select appropriate models for processing each video. Through finetuning the models and employing the Adaptive Pseudo‑labels Guided Model Refinement Pipeline in the inference phase, the STSeg solution achieved a J&F score of 87.26% on the test set of the 2025 4th PVUW Challenge MOSE Track, securing the 1st place and advancing the technology for video object segmentation in complex scenarios.

Abstract:
We propose a novel framework for accurate 3D human pose estimation in combat sports using sparse multi‑camera setups. Our method integrates robust multi‑view 2D pose tracking via a transformer‑based top‑down approach, employing epipolar geometry constraints and long‑term video object segmentation for consistent identity tracking across views. Initial 3D poses are obtained through weighted triangulation and spline smoothing, followed by kinematic optimization to refine pose accuracy. We further enhance pose realism and robustness by introducing a multi‑person physics‑based trajectory optimization step, effectively addressing challenges such as rapid motions, occlusions, and close interactions. Experimental results on diverse datasets, including a new benchmark of elite boxing footage, demonstrate state‑of‑the‑art performance. Additionally, we release comprehensive annotated video datasets to advance future research in multi‑person pose estimation for combat sports.

Abstract:
Childlike human figure drawings represent one of humanity's most accessible forms of character expression, yet automatically analyzing their contents remains a significant challenge. While semantic segmentation of realistic humans has recently advanced considerably, existing models often fail when confronted with the abstract, representational nature of childlike drawings. This semantic understanding is a crucial prerequisite for animation tools that seek to modify figures while preserving their unique style. To help achieve this, we propose a novel hierarchical segmentation model, built upon the architecture and pre‑trained SAM, to quickly and accurately obtain these semantic labels. Our model achieves higher accuracy than state‑of‑the‑art segmentation models focused on realistic humans and cartoon figures, even after fine‑tuning. We demonstrate the value of our model for semantic segmentation through multiple applications: a fully automatic facial animation pipeline, a figure relighting pipeline, improvements to an existing childlike human figure drawing animation method, and generalization to out‑of‑domain figures. Finally, to support future work in this area, we introduce a dataset of 16,000 childlike drawings with pixel‑level annotations across 25 semantic categories. Our work can enable entirely new, easily accessible tools for hand‑drawn character animation, and our dataset can enable new lines of inquiry in a variety of graphics and human‑centric research fields.

Abstract:
Object recognition using single‑point supervision has attracted increasing attention recently. However, the performance gap compared with fully‑supervised algorithms remains large. Previous works generated class‑agnostic proposals in an image offline and then treated mixed candidates as a single bag, putting a huge burden on multiple instance learning (MIL). In this paper, we introduce Point‑to‑Box Network (P2BNet), which constructs balanced instance‑level proposal bags by generating proposals in an anchor‑like way and refining the proposals in a coarse‑to‑fine paradigm. Through further research, we find that the bag of proposals, either at the image level or the instance level, is established on discrete box sampling. This leads the pseudo box estimation into a sub‑optimal solution, resulting in the truncation of object boundaries or the excessive inclusion of background. Hence, we conduct a series exploration of discrete‑to‑continuous optimization, yielding P2BNet++ and Point‑to‑Mask Network (P2MNet). P2BNet++ conducts an approximately continuous proposal sampling strategy by better utilizing spatial clues. P2MNet further introduces low‑level image information to assist in pixel prediction, and a boundary self‑prediction is designed to relieve the limitation of the estimated boxes. Benefiting from the continuous object‑aware pixel‑level perception, P2MNet can generate more precise bounding boxes and generalize to segmentation tasks. Our method largely surpasses the previous methods in terms of the mean average precision on COCO, VOC, SBD, and Cityscapes, demonstrating great potential to bridge the performance gap compared with fully supervised tasks.

Abstract:
Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in architectures with different inductive biases, which is crucial for enabling the student to acquire a more precise and comprehensive understanding of the data during distillation. To this end, we propose for the first time a generic knowledge distillation method for semantic segmentation from a heterogeneous perspective, named HeteroAKD. Due to the substantial disparities between heterogeneous architectures, such as CNN and Transformer, directly transferring cross‑architecture knowledge presents significant challenges. To eliminate the influence of architecture‑specific information, the intermediate features of both the teacher and student are skillfully projected into an aligned logits space. Furthermore, to utilize diverse knowledge from heterogeneous architectures and deliver customized knowledge required by the student, a teacher‑student knowledge mixing mechanism (KMM) and a teacher‑student knowledge evaluation mechanism (KEM) are introduced. These mechanisms are performed by assessing the reliability and its discrepancy between heterogeneous teacher‑student knowledge. Extensive experiments conducted on three main‑stream benchmarks using various teacher‑student pairs demonstrate that our HeteroAKD outperforms state‑of‑the‑art KD methods in facilitating distillation between heterogeneous architectures.

Abstract:
Open‑set semantic mapping is crucial for open‑world robots. Current mapping approaches either are limited by the depth range or only map beyond‑range entities in constrained settings, where overall they fail to combine within‑range and beyond‑range observations. Furthermore, these methods make a trade‑off between fine‑grained semantics and efficiency. We introduce RayFronts, a unified representation that enables both dense and beyond‑range efficient semantic mapping. RayFronts encodes task‑agnostic open‑set semantics to both in‑range voxels and beyond‑range rays encoded at map boundaries, empowering the robot to reduce search volumes significantly and make informed decisions both within & beyond sensory range, while running at 8.84 Hz on an Orin AGX. Benchmarking the within‑range semantics shows that RayFronts's fine‑grained image encoding provides 1.34x zero‑shot 3D semantic segmentation performance while improving throughput by 16.5x. Traditionally, online mapping performance is entangled with other system components, complicating evaluation. We propose a planner‑agnostic evaluation framework that captures the utility for online beyond‑range search and exploration, and show RayFronts reduces search volume 2.2x more efficiently than the closest online baselines.

Abstract:
Automated extraction of plant morphological traits is crucial for supporting crop breeding and agricultural management through high‑throughput field phenotyping (HTFP). Solutions based on multi‑view RGB images are attractive due to their scalability and affordability, enabling volumetric measurements that 2D approaches cannot directly capture. While advanced methods like Neural Radiance Fields (NeRFs) have shown promise, their application has been limited to counting or extracting traits from only a few plants or organs. Furthermore, accurately measuring complex structures like individual wheat heads‑essential for studying crop yields‑remains particularly challenging due to occlusions and the dense arrangement of crop canopies in field conditions. The recent development of 3D Gaussian Splatting (3DGS) offers a promising alternative for HTFP due to its high‑quality reconstructions and explicit point‑based representation. In this paper, we present Wheat3DGS, a novel approach that leverages 3DGS and the Segment Anything Model (SAM) for precise 3D instance segmentation and morphological measurement of hundreds of wheat heads automatically, representing the first application of 3DGS to HTFP. We validate the accuracy of wheat head extraction against high‑resolution laser scan data, obtaining per‑instance mean absolute percentage errors of 15.1%, 18.3%, and 40.2% for length, width, and volume. We provide additional comparisons to NeRF‑based approaches and traditional Muti‑View Stereo (MVS), demonstrating superior results. Our approach enables rapid, non‑destructive measurements of key yield‑related traits at scale, with significant implications for accelerating crop breeding and improving our understanding of wheat development.

Abstract:
In this paper, we propose a new evaluation metric called Domain Independence (DI) and Attenuation of Domain‑Specific Information (ADSI) which is specifically designed for domain‑generalized semantic segmentation in automotive images. DI measures the presence of domain‑specific information: a lower DI value indicates strong domain dependence, while a higher DI value suggests greater domain independence. This makes it roughly where domain‑specific information exists and up to which frequency range it is present. As a result, it becomes possible to effectively suppress only the regions in the image that contain domain‑specific information, enabling feature extraction independent of the domain. ADSI uses a Butterworth filter to remove the low‑frequency components of images that contain inherent domain‑specific information such as sensor characteristics and lighting conditions. However, since low‑frequency components also contain important information such as color, we should not remove them completely. Thus, a scalar value (ranging from 0 to 1) is multiplied by the low‑frequency components to retain essential information. This helps the model learn more domain‑independent features. In experiments, GTA5 (synthetic dataset) was used as training images, and a real‑world dataset was used for evaluation, and the proposed method outperformed conventional approaches. Similarly, in experiments that the Cityscapes (real‑world dataset) was used for training and various environment datasets such as rain and nighttime were used for evaluation, the proposed method demonstrated its robustness under nighttime conditions.

Abstract:
Chronic wounds affect a large population, particularly the elderly and diabetic patients, who often exhibit limited mobility and co‑existing health conditions. Automated wound monitoring via mobile image capture can reduce in‑person physician visits by enabling remote tracking of wound size. Semantic segmentation is key to this process, yet wound segmentation remains underrepresented in medical imaging research. To address this, we benchmark state‑of‑the‑art deep learning models from general‑purpose vision, medical imaging, and top methods from public wound challenges. For a fair comparison, we standardize training, data augmentation, and evaluation, conducting cross‑validation to minimize partitioning bias. We also assess real‑world deployment aspects, including generalization to an out‑of‑distribution wound dataset, computational efficiency, and interpretability. Additionally, we propose a reference object‑based approach to convert AI‑generated masks into clinically relevant wound size estimates and evaluate this, along with mask quality, for the five best architectures based on physician assessments. Overall, the transformer‑based TransNeXt showed the highest levels of generalizability. Despite variations in inference times, all models processed at least one image per second on the CPU, which is deemed adequate for the intended application. Interpretability analysis typically revealed prominent activations in wound regions, emphasizing focus on clinically relevant features. Expert evaluation showed high mask approval for all analyzed models, with VWFormer and ConvNeXtS backbone performing the best. Size retrieval accuracy was similar across models, and predictions closely matched expert annotations. Finally, we demonstrate how our AI‑driven wound size estimation framework, WoundAmbit, is integrated into a custom telehealth system.

Abstract:
This paper focuses on the development and modification of a beehive monitoring device and Varroa destructor detection on the bees with the help of hyperspectral imagery while utilizing a U‑net, semantic segmentation architecture, and conventional computer vision methods. The main objectives were to collect a dataset of bees and mites, and propose the computer vision model which can achieve the detection between bees and mites.

Abstract:
The primary focus of most recent works on open‑vocabulary neural fields is extracting precise semantic features from the VLMs and then consolidating them efficiently into a multi‑view consistent 3D neural fields representation. However, most existing works over‑trusted SAM to regularize image‑level CLIP without any further refinement. Moreover, several existing works improved efficiency by dimensionality reduction of semantic features from 2D VLMs before fusing with 3DGS semantic fields, which inevitably leads to multi‑view inconsistency. In this work, we propose econSG for open‑vocabulary semantic segmentation with 3DGS. Our econSG consists of: 1) A Confidence‑region Guided Regularization (CRR) that mutually refines SAM and CLIP to get the best of both worlds for precise semantic features with complete and precise boundaries. 2) A low dimensional contextual space to enforce 3D multi‑view consistency while improving computational efficiency by fusing backprojected multi‑view 2D features and follow by dimensional reduction directly on the fused 3D features instead of operating on each 2D view separately. Our econSG shows state‑of‑the‑art performance on four benchmark datasets compared to the existing methods. Furthermore, we are also the most efficient training among all the methods.

Abstract:
Recent mainstream unsupervised video object segmentation (UVOS) motion‑appearance approaches use either the bi‑encoder structure to separately encode motion and appearance features, or the uni‑encoder structure for joint encoding. However, these methods fail to properly balance the motion‑appearance relationship. Consequently, even with complex fusion modules for motion‑appearance integration, the extracted suboptimal features degrade the models' overall performance. Moreover, the quality of optical flow varies across scenarios, making it insufficient to rely solely on optical flow to achieve high‑quality segmentation results. To address these challenges, we propose the Saliency‑Motion guided Trunk‑Collateral Network (SMTC‑Net), which better balances the motion‑appearance relationship and incorporates model's intrinsic saliency information to enhance segmentation performance. Specifically, considering that optical flow maps are derived from RGB images, they share both commonalities and differences. Accordingly, we propose a novel Trunk‑Collateral structure for motion‑appearance UVOS. The shared trunk backbone captures the motion‑appearance commonality, while the collateral branch learns the uniqueness of motion features. Furthermore, an Intrinsic Saliency guided Refinement Module (ISRM) is devised to efficiently leverage the model's intrinsic saliency information to refine high‑level features, and provide pixel‑level guidance for motion‑appearance fusion, thereby enhancing performance without additional input. Experimental results show that SMTC‑Net achieved state‑of‑the‑art performance on three UVOS datasets ( 89.2% J&F on DAVIS‑16, 76% J on YouTube‑Objects, 86.4% J on FBMS ) and four standard video salient object detection (VSOD) benchmarks with the notable increase, demonstrating its effectiveness and superiority over previous methods.

Abstract:
3D semantic segmentation plays a critical role in urban modelling, enabling detailed understanding and mapping of city environments. In this paper, we introduce Turin3D: a new aerial LiDAR dataset for point cloud semantic segmentation covering an area of around 1.43 km2 in the city centre of Turin with almost 70M points. We describe the data collection process and compare Turin3D with others previously proposed in the literature. We did not fully annotate the dataset due to the complexity and time‑consuming nature of the process; however, a manual annotation process was performed on the validation and test sets, to enable a reliable evaluation of the proposed techniques. We first benchmark the performances of several point cloud semantic segmentation models, trained on the existing datasets, when tested on Turin3D, and then improve their performances by applying a semi‑supervised learning technique leveraging the unlabelled training set. The dataset will be publicly available to support research in outdoor point cloud segmentation, with particular relevance for self‑supervised and semi‑supervised learning approaches given the absence of ground truth annotations for the training set.

Abstract:
Recently, state space models (SSM), particularly Mamba, have attracted significant attention from scholars due to their ability to effectively balance computational efficiency and performance. However, most existing visual Mamba methods flatten images into 1D sequences using predefined scan orders, which results the model being less capable of utilizing the spatial structural information of the image during the feature extraction process. To address this issue, we proposed a novel visual foundation model called DefMamba. This model includes a multi‑scale backbone structure and deformable mamba (DM) blocks, which dynamically adjust the scanning path to prioritize important information, thus enhancing the capture and processing of relevant input features. By combining a deformable scanning(DS) strategy, this model significantly improves its ability to learn image structures and detects changes in object details. Numerous experiments have shown that DefMamba achieves state‑of‑the‑art performance in various visual tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is open source on DefMamba.

Abstract:
Recent advances in Vision Transformers (ViTs) have significantly advanced semantic segmentation performance. However, their adaptation to new target domains remains challenged by distribution shifts, which often disrupt global attention mechanisms. While existing global and patch‑level adaptation methods offer some improvements, they overlook the spatially varying transferability inherent in different image regions. To address this, we propose the Transferable Mask Transformer (TMT), a region‑adaptive framework designed to enhance cross‑domain representation learning through transferability guidance. First, we dynamically partition the image into coherent regions, grouped by structural and semantic similarity, and estimates their domain transferability at a localized level. Then, we incorporate region‑level transferability maps directly into the self‑attention mechanism of ViTs, allowing the model to adaptively focus attention on areas with lower transferability and higher semantic uncertainty. Extensive experiments across 20 diverse cross‑domain settings demonstrate that TMT not only mitigates the performance degradation typically associated with domain shift but also consistently outperforms existing approaches.

Abstract:
Neural Radiance Fields (NeRF) have been widely adopted for reconstructing high quality 3D point clouds from 2D RGB images. However, the segmentation of these reconstructed 3D scenes is more essential for downstream tasks such as object counting, size estimation, and scene understanding. While segmentation on raw 3D point clouds using deep learning requires labor intensive and time‑consuming manual annotation, directly training NeRF on binary masks also fails due to the absence of color and shading cues essential for geometry learning. We propose Invariant NeRF for Segmentation (InvNeRFSeg), a two step, zero change fine tuning strategy for 3D segmentation. We first train a standard NeRF on RGB images and then fine tune it using 2D segmentation masks without altering either the model architecture or loss function. This approach produces higher quality, cleaner segmented point clouds directly from the refined radiance field with minimal computational overhead or complexity. Field density analysis reveals consistent semantic refinement: densities of object regions increase while background densities are suppressed, ensuring clean and interpretable segmentations. We demonstrate InvNeRFSegs superior performance over both SA3D and FruitNeRF on both synthetic fruit and real world soybean datasets. This approach effectively extends 2D segmentation to high quality 3D segmentation.

Abstract:
This paper investigates the use of large‑scale diffusion models for Zero‑Shot Video Object Segmentation (ZS‑VOS) without fine‑tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS‑VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS‑VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS‑17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS‑VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state‑of‑the‑art results in ZS‑VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

Abstract:
Semi‑supervised instance segmentation poses challenges due to limited labeled data, causing difficulties in accurately localizing distinct object instances. Current teacher‑student frameworks still suffer from performance constraints due to unreliable pseudo‑label quality stemming from limited labeled data. While the Segment Anything Model (SAM) offers robust segmentation capabilities at various granularities, directly applying SAM to this task introduces challenges such as class‑agnostic predictions and potential over‑segmentation. To address these complexities, we carefully integrate SAM into the semi‑supervised instance segmentation framework, developing a novel distillation method that effectively captures the precise localization capabilities of SAM without compromising semantic recognition. Furthermore, we incorporate pseudo‑label refinement as well as a specialized data augmentation with the refined pseudo‑labels, resulting in superior performance. We establish state‑of‑the‑art performance, and provide comprehensive experiments and ablation studies to validate the effectiveness of our proposed approach.

Abstract:
Motion expression video segmentation is designed to segment objects in accordance with the input motion expressions. In contrast to the conventional Referring Video Object Segmentation (RVOS), it places emphasis on motion as well as multi‑object expressions, making it more arduous. Recently, Large Multimodal Models (LMMs) have begun to shine in RVOS due to their powerful vision‑language perception capabilities. In this work, we propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation. Firstly, we use Sa2VA as our baseline, which is a unified LMM for dense grounded understanding of both images and videos. Secondly, we uniformly sample the video frames during the inference process to enhance the model's understanding of the entire video. Finally, we integrate the results of multiple expert models to mitigate the erroneous predictions of a single model. Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.

Abstract:
Box‑supervised instance segmentation methods aim to achieve instance segmentation with only box annotations. Recent methods have demonstrated the effectiveness of acquiring high‑quality pseudo masks under the teacher‑student framework. Building upon this foundation, we propose a BoxSeg framework involving two novel and general modules named the Quality‑Aware Module (QAM) and the Peer‑assisted Copy‑paste (PC). The QAM obtains high‑quality pseudo masks and better measures the mask quality to help reduce the effect of noisy masks, by leveraging the quality‑aware multi‑mask complementation mechanism. The PC imitates Peer‑Assisted Learning to further improve the quality of the low‑quality masks with the guidance of the obtained high‑quality pseudo masks. Theoretical and experimental analyses demonstrate the proposed QAM and PC are effective. Extensive experimental results show the superiority of our BoxSeg over the state‑of‑the‑art methods, and illustrate the QAM and PC can be applied to improve other models.

Abstract:
Machine learning‑based embedded systems for safety‑critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology‑agnostic approach. We focus on encoder‑decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD‑Xilinx's KV260 SoM.

Abstract:
In panoptic segmentation, individual instances must be separated within semantic classes. As state‑of‑the‑art methods rely on a pre‑defined set of classes, they struggle with novel categories and out‑of‑distribution (OOD) data. This is particularly problematic in safety‑critical applications, such as autonomous driving, where reliability in unseen scenarios is essential. We address the gap between outstanding benchmark performance and reliability by proposing Prior2Former (P2F), the first approach for segmentation vision transformers rooted in evidential learning. P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel‑wise binary mask assignments. This design enables high‑quality uncertainty estimation that effectively detects novel and OOD objects enabling state‑of‑the‑art anomaly instance segmentation and open‑world panoptic segmentation. Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes, making it highly applicable in real‑world scenarios where such prior information is unavailable. Additionally, P2F can be flexibly applied to anomaly instance and panoptic segmentation. Through comprehensive experiments on the Cityscapes, COCO, SegmentMeIfYouCan, and OoDIS datasets, P2F demonstrates state‑of‑the‑art performance across the board.

Abstract:
Precise video retrieval requires multi‑modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles‑based video segmentation approach. Additionally, the aural stream includes a complementary audio‑based two‑stage retrieval mechanism that enhances performance on long‑duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long‑video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

Abstract:
Instance segmentation plays a pivotal role in medical image analysis by enabling precise localization and delineation of lesions, tumors, and anatomical structures. Although deep learning models such as Mask R‑CNN and BlendMask have achieved remarkable progress, their application in high‑risk medical scenarios remains constrained by confidence calibration issues, which may lead to misdiagnosis. To address this challenge, we propose a robust quality control framework based on conformal prediction theory. This framework innovatively constructs a risk‑aware dynamic threshold mechanism that adaptively adjusts segmentation decision boundaries according to clinical requirements.Specifically, we design a calibration‑aware loss function that dynamically tunes the segmentation threshold based on a user‑defined risk level α. Utilizing exchangeable calibration data, this method ensures that the expected FNR or FDR on test data remains below α with high probability. The framework maintains compatibility with mainstream segmentation models (e.g., Mask R‑CNN, BlendMask+ResNet‑50‑FPN) and datasets (PASCAL VOC format) without requiring architectural modifications. Empirical results demonstrate that we rigorously bound the FDR metric marginally over the test set via our developed calibration framework.

Abstract:
Model compression is a critical area of research in deep learning, in particular in vision, driven by the need to lighten models memory or computational footprints. While numerous methods for model compression have been proposed, most focus on pruning, quantization, or knowledge distillation. In this work, we delve into an under‑explored avenue: reducing the resolution of the input image as a complementary approach to other types of compression. By systematically investigating the impact of input resolution reduction, on both tasks of classification and semantic segmentation, and on convnets and transformer‑based architectures, we demonstrate that this strategy provides an interesting alternative for model compression. Our experimental results on standard benchmarks highlight the potential of this method, achieving competitive performance while significantly reducing computational and memory requirements. This study establishes input resolution reduction as a viable and promising direction in the broader landscape of model compression techniques for vision applications.

Abstract:
Aquatic bodies face numerous environmental threats caused by several marine anomalies. Marine debris can devastate habitats and endanger marine life through entanglement, while harmful algal blooms can produce toxins that negatively affect marine ecosystems. Additionally, ships may discharge oil or engage in illegal and overfishing activities, causing further harm. These marine anomalies can be identified by applying trained deep learning (DL) models on multispectral satellite imagery. Furthermore, the detection of other anomalies, such as clouds, could be beneficial in filtering out irrelevant images. However, DL models often require a large volume of labeled data for training, which can be both costly and time‑consuming, particularly for marine anomaly detection where expert annotation is needed. A potential solution is the use of semi‑supervised learning methods, which can also utilize unlabeled data. In this project, we implement and study the performance of FixMatch for Semantic Segmentation, a semi‑supervised algorithm for semantic segmentation. Firstly, we found that semi‑supervised models perform best with a high confidence threshold of 0.9 when there is a limited amount of labeled data. Secondly, we compare the performance of semi‑supervised models with fully‑supervised models under varying amounts of labeled data. Our findings suggest that semi‑supervised models outperform fully‑supervised models with limited labeled data, while fully‑supervised models have a slightly better performance with larger volumes of labeled data. We propose two hypotheses to explain why fully‑supervised models surpass semi‑supervised ones when a high volume of labeled data is used. All of our experiments were conducted using a U‑Net model architecture with a limited number of parameters to ensure compatibility with space‑rated hardware.

Abstract:
The rapid increase in remote sensing satellites has led to the emergence of distributed space‑based observation systems. However, existing distributed remote sensing models often rely on centralized training, resulting in data leakage, communication overhead, and reduced accuracy due to data distribution discrepancies across platforms. To address these challenges, we propose the Self‑Adjustment FEderated Learning (SAFE) framework, which innovatively leverages federated learning to enhance collaborative sensing in remote sensing scenarios. SAFE introduces four key strategies: (1) Class Rectification Optimization, which autonomously addresses class imbalance under unknown local and global distributions. (2) Feature Alignment Update, which mitigates Non‑IID data issues via locally controlled EMA updates. (3) Dual‑Factor Modulation Rheostat, which dynamically balances optimization effects during training. (4) Adaptive Context Enhancement, which is designed to improve model performance by dynamically refining foreground regions, ensuring computational efficiency with accuracy improvement across distributed satellites. Experiments on real‑world image classification and object segmentation datasets validate the effectiveness and reliability of the SAFE framework in complex remote sensing scenarios.

Abstract:
Medical image and video segmentation is a critical task for precision medicine, which has witnessed considerable progress in developing task or modality‑specific and generalist models for 2D images. However, there have been limited studies on building general‑purpose models for 3D images and videos with comprehensive user studies. Here, we present MedSAM2, a promptable segmentation foundation model for 3D image and video segmentation. The model is developed by fine‑tuning the Segment Anything Model 2 on a large medical dataset with over 455,000 3D image‑mask pairs and 76,000 frames, outperforming previous models across a wide range of organs, lesions, and imaging modalities. Furthermore, we implement a human‑in‑the‑loop pipeline to facilitate the creation of large‑scale datasets resulting in, to the best of our knowledge, the most extensive user study to date, involving the annotation of 5,000 CT lesions, 3,984 liver MRI lesions, and 251,550 echocardiogram video frames, demonstrating that MedSAM2 can reduce manual costs by more than 85%. MedSAM2 is also integrated into widely used platforms with user‑friendly interfaces for local and cloud deployment, making it a practical tool for supporting efficient, scalable, and high‑quality segmentation in both research and healthcare environments.

Abstract:
Grasping is a fundamental skill for interacting with the environment. However, this ability can be difficult for some (e.g. due to disability). Wearable robotic solutions can enhance or restore hand function, and recent advances have leveraged computer vision to improve grasping capabilities. However, grasping transparent objects remains challenging due to their poor visual contrast and ambiguous depth cues. Furthermore, while multimodal control strategies incorporating tactile and auditory feedback have been explored to grasp transparent objects, the integration of vision with these modalities remains underdeveloped. This paper introduces MultiClear, a multimodal framework designed to enhance grasping assistance in a wearable soft exoskeleton glove for transparent objects by fusing RGB data, depth data, and auditory signals. The exoskeleton glove integrates a tendon‑driven actuator with an RGB‑D camera and a built‑in microphone. To achieve precise and adaptive control, a hierarchical control architecture is proposed. For the proposed hierarchical control architecture, a high‑level control layer provides contextual awareness, a mid‑level control layer processes multimodal sensory inputs, and a low‑level control executes PID motor control for fine‑tuned grasping adjustments. The challenge of transparent object segmentation was managed by introducing a vision foundation model for zero‑shot segmentation. The proposed system achieves a Grasping Ability Score of 70.37%, demonstrating its effectiveness in transparent object manipulation.

Abstract:
Developing computer vision‑based rice phenotyping techniques is crucial for precision field management and accelerating breeding, thereby continuously advancing rice production. Among phenotyping tasks, distinguishing image components is a key prerequisite for characterizing plant growth and development at the organ scale, enabling deeper insights into eco‑physiological processes. However, due to the fine structure of rice organs and complex illumination within the canopy, this task remains highly challenging, underscoring the need for a high‑quality training dataset. Such datasets are scarce, both due to a lack of large, representative collections of rice field images and the time‑intensive nature of annotation. To address this gap, we established the first comprehensive multi‑class rice semantic segmentation dataset, RiceSEG. We gathered nearly 50,000 high‑resolution, ground‑based images from five major rice‑growing countries (China, Japan, India, the Philippines, and Tanzania), encompassing over 6,000 genotypes across all growth stages. From these original images, 3,078 representative samples were selected and annotated with six classes (background, green vegetation, senescent vegetation, panicle, weeds, and duckweed) to form the RiceSEG dataset. Notably, the sub‑dataset from China spans all major genotypes and rice‑growing environments from the northeast to the south. Both state‑of‑the‑art convolutional neural networks and transformer‑based semantic segmentation models were used as baselines. While these models perform reasonably well in segmenting background and green vegetation, they face difficulties during the reproductive stage, when canopy structures are more complex and multiple classes are involved. These findings highlight the importance of our dataset for developing specialized segmentation models for rice and other crops.

Abstract:
Forest stands are the fundamental units in forest management inventories, silviculture, and financial analysis within operational forestry. Over the past two decades, a common method for mapping stand borders has involved delineation through manual interpretation of stereographic aerial images. This is a time‑consuming and subjective process, limiting operational efficiency and introducing inconsistencies. Substantial effort has been devoted to automating the process, using various algorithms together with aerial images and canopy height models constructed from airborne laser scanning (ALS) data, but manual interpretation remains the preferred method. Deep learning (DL) methods have demonstrated great potential in computer vision, yet their application to forest stand delineation remains unexplored in published research. This study presents a novel approach, framing stand delineation as a multiclass segmentation problem and applying a U‑Net based DL framework. The model was trained and evaluated using multispectral images, ALS data, and an existing stand map created by an expert interpreter. Performance was assessed on independent data using overall accuracy, a standard metric for classification tasks that measures the proportions of correctly classified pixels. The model achieved an overall accuracy of 0.73. These results demonstrate strong potential for DL in automated stand delineation. However, a few key challenges were noted, especially for complex forest environments.

Abstract:
Nuclear instance segmentation and classification provide critical quantitative foundations for digital pathology diagnosis. With the advent of the foundational Segment Anything Model (SAM), the accuracy and efficiency of nuclear segmentation have improved significantly. However, SAM imposes a strong reliance on precise prompts, and its class‑agnostic design renders its classification results entirely dependent on the provided prompts. Therefore, we focus on generating prompts with more accurate localization and classification and propose APSeg, Auto‑Prompt model with acquired and injected knowledge for nuclear instance Segmentation and classification. APSeg incorporates two knowledge‑aware modules: (1) Distribution‑Guided Proposal Offset Module (DG‑POM), which learns distribution knowledge through density map guided, and (2) Category Knowledge Semantic Injection Module (CK‑SIM), which injects morphological knowledge derived from category descriptions. We conducted extensive experiments on the PanNuke and CoNSeP datasets, demonstrating the effectiveness of our approach. The code will be released upon acceptance.

Abstract:
3D point cloud semantic segmentation (PCSS) is a cornerstone for environmental perception in robotic systems and autonomous driving, enabling precise scene understanding through point‑wise classification. While unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing methods critically overlook the inherent vulnerability to real‑world perturbations (e.g., snow, fog, rain) and adversarial distortions. This work first identifies two intrinsic limitations that undermine current PCSS‑UDA robustness: (a) unsupervised features overlap from unaligned boundaries in shared‑class regions and (b) feature structure erosion caused by domain‑invariant learning that suppresses target‑specific patterns. To address the proposed problems, we propose a tripartite framework consisting of: 1) a robustness evaluation model quantifying resilience against adversarial attack/corruption types through robustness metrics; 2) an invertible attention alignment module (IAAM) enabling bidirectional domain mapping while preserving discriminative structure via attention‑guided overlap suppression; and 3) a contrastive memory bank with quality‑aware contrastive learning that progressively refines pseudo‑labels with feature quality for more discriminative representations. Extensive experiments on SynLiDAR‑to‑SemanticPOSS adaptation demonstrate a maximum mIoU improvement of 14.3% under adversarial attack.

Abstract:
Unsupervised domain adaptation (UDA) frameworks have shown good generalization capabilities for 3D point cloud semantic segmentation models on clean data. However, existing works overlook adversarial robustness when the source domain itself is compromised. To comprehensively explore the robustness of the UDA frameworks, we first design a stealthy adversarial point cloud generation attack that can significantly contaminate datasets with only minor perturbations to the point cloud surface. Based on that, we propose a novel dataset, AdvSynLiDAR, comprising synthesized contaminated LiDAR point clouds. With the generated corrupted data, we further develop the Adversarial Adaptation Framework (AAF) as the countermeasure. Specifically, by extending the key point sensitive (KPS) loss towards the Robust Long‑Tail loss (RLT loss) and utilizing a decoder branch, our approach enables the model to focus on long‑tail classes during the pre‑training phase and leverages high‑confidence decoded point cloud information to restore point cloud structures during the adaptation phase. We evaluated our AAF method on the AdvSynLiDAR dataset, where the results demonstrate that our AAF method can mitigate performance degradation under source adversarial perturbations for UDA in the 3D point cloud segmentation application.

Abstract:
3D point cloud semantic segmentation technology has been widely used. However, in real‑world scenarios, the environment is evolving. Thus, offline‑trained segmentation models may lead to catastrophic forgetting of previously seen classes. Class‑incremental learning (CIL) is designed to address the problem of catastrophic forgetting. While point clouds are common, we observe high similarity and unclear boundaries between different classes. Meanwhile, they are known to be imbalanced in class distribution. These lead to issues including misclassification between similar classes and the long‑tail problem, which have not been adequately addressed in previous CIL methods. We thus propose ProtoGuard and PROPEL (Progressive Refinement Of PsEudo‑Labels). In the base‑class training phase, ProtoGuard maintains geometric and semantic prototypes for each class, which are combined into prototype features using an attention mechanism. In the novel‑class training phase, PROPEL inherits the base feature extractor and classifier, guiding pseudo‑label propagation and updates based on density distribution and semantic similarity. Extensive experiments show that our approach achieves remarkable results on both the S3DIS and ScanNet datasets, improving the mIoU of 3D point cloud segmentation by a maximum of 20.39% under the 5‑step CIL scenario on S3DIS.

Abstract:
The robustness of deep neural networks is a crucial factor in safety‑critical applications, particularly in complex and dynamic environments (e.g., medical or driving scenarios) where localized corruptions can arise. While previous studies have evaluated the robustness of semantic segmentation (SS) models under whole‑image natural or adversarial corruptions, a comprehensive investigation into the spatial robustness of dense vision models under localized corruptions remains underexplored. This paper fills this gap by introducing novel, region‑aware metrics for benchmarking the spatial robustness of segmentation models, along with an evaluation framework to assess the impact of natural localized corruptions. Furthermore, it uncovers the inherent complexity of evaluating worst‑case spatial robustness using only a single localized adversarial attack. To address this, the work proposes a region‑aware multi‑attack adversarial analysis to systematically assess model robustness across specific image regions. The proposed metrics and analysis were exploited to evaluate 14 segmentation models in driving scenarios, uncovering key insights into the effects of localized corruption in both natural and adversarial forms. The results reveal that models respond to these two types of threats differently; for instance, transformer‑based segmentation models demonstrate notable robustness to localized natural corruptions but are highly vulnerable to adversarial ones, and vice versa for CNN‑based models. Consequently, we also address the challenge of balancing robustness to both natural and adversarial localized corruptions by means of ensemble models, thereby achieving a broader threat coverage and improved reliability for dense vision tasks.

Abstract:
Nuclear instance segmentation plays a vital role in disease diagnosis within digital pathology. However, limited labeled data in pathological images restricts the overall performance of nuclear instance segmentation. To tackle this challenge, we propose a novel data augmentation framework Instance Migration Diffusion Model (IM‑Diffusion), IM‑Diffusion designed to generate more varied pathological images by constructing diverse nuclear layouts and internuclear spatial relationships. In detail, we introduce a Nuclear Migration Module (NMM) which constructs diverse nuclear layouts by simulating the process of nuclear migration. Building on this, we further present an Internuclear‑regions Inpainting Module (IIM) to generate diverse internuclear spatial relationships by structure‑aware inpainting. On the basis of the above, IM‑Diffusion generates more diverse pathological images with different layouts and internuclear spatial relationships, thereby facilitating downstream tasks. Evaluation on the CoNSeP and GLySAC datasets demonstrate that the images generated by IM‑Diffusion effectively enhance overall instance segmentation performance. Code will be made public later.

Abstract:
LiDAR‑based 3D point cloud recognition has been proven beneficial in various applications. However, the sparsity and varying density pose a significant challenge in capturing intricate details of objects, particularly for medium‑range and small targets. Therefore, we propose a multi‑modal point cloud semantic segmentation method based on Virtual Point Enhancement (VPE), which integrates virtual points generated from images to address these issues. These virtual points are dense but noisy, and directly incorporating them can increase computational burden and degrade performance. Therefore, we introduce a spatial difference‑driven adaptive filtering module that selectively extracts valuable pseudo points from these virtual points based on density and distance, enhancing the density of medium‑range targets. Subsequently, we propose a noise‑robust sparse feature encoder that incorporates noise‑robust feature extraction and fine‑grained feature enhancement. Noise‑robust feature extraction exploits the 2D image space to reduce the impact of noisy points, while fine‑grained feature enhancement boosts sparse geometric features through inner‑voxel neighborhood point aggregation and downsampled voxel aggregation. The results on the SemanticKITTI and nuScenes, two large‑scale benchmark data sets, have validated effectiveness, significantly improving 2.89% mIoU with the introduction of 7.7% virtual points on nuScenes.

Abstract:
Rip currents are strong, localized and narrow currents of water that flow outwards into the sea, causing numerous beach‑related injuries and fatalities worldwide. Accurate identification of rip currents remains challenging due to their amorphous nature and the lack of annotated data, which often requires expert knowledge. To address these issues, we present RipVIS, a large‑scale video instance segmentation benchmark explicitly designed for rip current segmentation. RipVIS is an order of magnitude larger than previous datasets, featuring 184 videos (212,328 frames), of which 150 videos (163,528 frames) are with rip currents, collected from various sources, including drones, mobile phones, and fixed beach cameras. Our dataset encompasses diverse visual contexts, such as wave‑breaking patterns, sediment flows, and water color variations, across multiple global locations, including USA, Mexico, Costa Rica, Portugal, Italy, Greece, Romania, Sri Lanka, Australia and New Zealand. Most videos are annotated at 5 FPS to ensure accuracy in dynamic scenarios, supplemented by an additional 34 videos (48,800 frames) without rip currents. We conduct comprehensive experiments with Mask R‑CNN, Cascade Mask R‑CNN, SparseInst and YOLO11, fine‑tuning these models for the task of rip current segmentation. Results are reported in terms of multiple metrics, with a particular focus on the F_2 score to prioritize recall and reduce false negatives. To enhance segmentation performance, we introduce a novel post‑processing step based on Temporal Confidence Aggregation (TCA). RipVIS aims to set a new standard for rip current segmentation, contributing towards safer beach environments. We offer a benchmark website to share data, models, and results with the research community, encouraging ongoing collaboration and future contributions, at https://ripvis.ai.

Abstract:
We present SuperDec, an approach for creating compact 3D scene representations via decomposition into superquadric primitives. While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.

Abstract:
Zero‑shot 4D segmentation and recognition of arbitrary objects in Lidar is crucial for embodied navigation, with applications ranging from streaming perception to semantic mapping and localization. However, the primary challenge in advancing research and developing generalized, versatile methods for spatio‑temporal scene understanding in Lidar lies in the scarcity of datasets that provide the necessary diversity and scale of annotations.To overcome these challenges, we propose SAL‑4D (Segment Anything in Lidar‑‑4D), a method that utilizes multi‑modal robotic sensor setups as a bridge to distill recent developments in Video Object Segmentation (VOS) in conjunction with off‑the‑shelf Vision‑Language foundation models to Lidar. We utilize VOS models to pseudo‑label tracklets in short video sequences, annotate these tracklets with sequence‑level CLIP tokens, and lift them to the 4D Lidar space using calibrated multi‑modal sensory setups to distill them to our SAL‑4D model. Due to temporal consistent predictions, we outperform prior art in 3D Zero‑Shot Lidar Panoptic Segmentation (LPS) over 5 PQ, and unlock Zero‑Shot 4D‑LPS.

Abstract:
Promoting the connectivity of curvilinear structures, such as neuronal processes in biomedical scans and blood vessels in CT images, remains a key challenge in semantic segmentation. Traditional pixel‑wise loss functions, including cross‑entropy and Dice losses, often fail to capture high‑level topological connectivity, resulting in topological mistakes in graphs obtained from prediction maps. In this paper, we propose CAPE (Connectivity‑Aware Path Enforcement), a novel loss function designed to enforce connectivity in graphs obtained from segmentation maps by optimizing a graph connectivity metric. CAPE uses the graph representation of the ground truth to select node pairs and determine their corresponding paths within the predicted segmentation through a shortest‑path algorithm. Using this, we penalize both disconnections and false positive connections, effectively promoting the model to preserve topological correctness. Experiments on 2D and 3D datasets, including neuron and blood vessel tracing demonstrate that CAPE significantly improves topology‑aware metrics and outperforms state‑of‑the‑art methods.

Abstract:
The Segment Anything Model 2 (SAM2), a prompt‑guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real‑world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion‑appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high‑quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video‑based adaptive multi‑prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three‑step process to generate reliable prompts by camouflaged object determination, pivotal prompting frame selection, and multi‑prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state‑of‑the‑art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models.

Abstract:
In the recent years, the research community has witnessed growing use of 3D point cloud data for the high applicability in various real‑world applications. By means of 3D point cloud, this modality enables to consider the actual size and spatial understanding. The applied fields include mechanical control of robots, vehicles, or other real‑world systems. Along this line, we would like to improve 3D point cloud instance segmentation which has emerged as a particularly promising approach for these applications. However, the creation of 3D point cloud datasets entails enormous costs compared to 2D image datasets. To train a model of 3D point cloud instance segmentation, it is necessary not only to assign categories but also to provide detailed annotations for each point in the large‑scale 3D space. Meanwhile, the increase of recent proposals for generative models in 3D domain has spurred proposals for using a generative model to create 3D point cloud data. In this work, we propose a pre‑training with 3D synthetic data to train a 3D point cloud instance segmentation model based on generative model for 3D scenes represented by point cloud data. We directly generate 3D point cloud data with Point‑E for inserting a generated data into a 3D scene. More recently in 2025, although there are other accurate 3D generation models, even using the Point‑E as an early 3D generative model can effectively support the pre‑training with 3D synthetic data. In the experimental section, we compare our pre‑training method with baseline methods indicated improved performance, demonstrating the efficacy of 3D generative models for 3D point cloud instance segmentation.

Abstract:
Grounded Conversation Generation (GCG) is an emerging vision‑language task that requires models to generate natural language responses seamlessly intertwined with corresponding object segmentation masks. Recent models, such as GLaMM and OMG‑LLaVA, achieve pixel‑level grounding but incur significant computational costs due to processing a large number of visual tokens. Existing token pruning methods, like FastV and PyramidDrop, fail to preserve the local visual features critical for accurate grounding, leading to substantial performance drops in GCG tasks. To address this, we propose Adaptive Local‑Aware Token Pruning (ALTP), a simple yet effective framework that accelerates GCG models by prioritizing local object information. ALTP introduces two key components: (1) Detail Density Capture (DDC), which uses superpixel segmentation to retain tokens in object‑centric regions, preserving fine‑grained details, and (2) Dynamic Density Formation (DDF), which dynamically allocates tokens based on information density, ensuring higher retention in semantically rich areas. Extensive experiments on the GranDf dataset demonstrate that ALTP significantly outperforms existing token pruning methods, such as FastV and PyramidDrop, on both GLaMM and OMG‑LLaVA models. Notably, when applied to GLaMM, ALTP achieves a 90% reduction in visual tokens with a 4.9% improvement in AP50 and a 5.0% improvement in Recall compared to PyramidDrop. Similarly, on OMG‑LLaVA, ALTP improves AP by 2.1% and mIOU by 3.0% at a 90% token reduction compared with PDrop.

Abstract:
Recent studies have shown that 2D convolution and self‑attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high‑pass filtering than self‑attention and why larger kernels favor shape bias, akin to self‑attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self‑attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a spectral‑adaptive modulation (SPAM) mixer, which processes visual features in a spectral‑adaptive manner using multi‑scale convolutional kernels and a spectral re‑scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state‑of‑the‑art models across multiple vision tasks, including ImageNet‑1K classification, COCO object detection, and ADE20K semantic segmentation.

Abstract:
Generalized zero‑shot semantic segmentation (GZS3) aims to achieve the human‑level capability of segmenting not only seen classes but also novel class regions unseen in the training data through introducing the bridge of semantic representations, e.g., word vector. While effective, the way of utilizing one semantic representation to associate the corresponding class and to enable the knowledge transfer from seen to unseen classes is insufficient as well as incompatible with human cognition. Inspired by the observation that humans often use some `part' and `state' information to comprehend the seen objects and imagine unseen classes, we decouple each class into detailed descriptions, including object parts and states. Based on the decoupling formulation, we propose a Decoupled Vision‑Language Matching (DeVLMatch) framework, composed of spatial‑part (SPMatch) and channel‑state (CSMatch) matching modules, for GZS3. In SPMatch, we comprehend objects with spatial part information from both visual and linguistic perspectives and perform graph matching to bridge the gap. In CSMatch, states of objects from the linguistic perspective are matched to compatible channel information from the visual perspective. By decoupling and matching objects across visual and linguistic comprehension, we can explicitly introspect the relationship between seen and unseen classes in fine‑grained object part and state levels, thereby facilitating the knowledge transfer from seen to unseen classes in visual space. The proposed DeVLMatch framework surpasses the previous GZS3 methods on standard benchmarks, including PASCAL VOC, COCO‑Stuff, and CATARACTS, demonstrating its effectiveness.

Abstract:
In this work, we present DEcoupLEd Distillation To Erase (DELETE), a general and strong unlearning method for any class‑centric tasks. To derive this, we first propose a theoretical framework to analyze the general form of unlearning loss and decompose it into forgetting and retention terms. Through the theoretical framework, we point out that a class of previous methods could be mainly formulated as a loss that implicitly optimizes the forgetting term while lacking supervision for the retention term, disturbing the distribution of pre‑trained model and struggling to adequately preserve knowledge of the remaining classes. To address it, we refine the retention term using "dark knowledge" and propose a mask distillation unlearning method. By applying a mask to separate forgetting logits from retention logits, our approach optimizes both the forgetting and refined retention components simultaneously, retaining knowledge of the remaining classes while ensuring thorough forgetting of the target class. Without access to the remaining data or intervention (i.e., used in some works), we achieve state‑of‑the‑art performance across various benchmarks. What's more, DELETE is a general solution that can be applied to various downstream tasks, including face recognition, backdoor defense, and semantic segmentation with great performance.

Abstract:
Text semantic segmentation involves partitioning a document into multiple paragraphs with continuous semantics based on the subject matter, contextual information, and document structure. Traditional approaches have typically relied on preprocessing documents into segments to address input length constraints, resulting in the loss of critical semantic information across segments. To address this, we present CrossFormer, a transformer‑based model featuring a novel cross‑segment fusion module that dynamically models latent semantic dependencies across document segments, substantially elevating segmentation accuracy. Additionally, CrossFormer can replace rule‑based chunk methods within the Retrieval‑Augmented Generation (RAG) system, producing more semantically coherent chunks that enhance its efficacy. Comprehensive evaluations confirm CrossFormer's state‑of‑the‑art performance on public text semantic segmentation datasets, alongside considerable gains on RAG benchmarks.

Abstract:
Semi‑supervised semantic segmentation (SS‑SS) aims to mitigate the heavy annotation burden of dense pixel labeling by leveraging abundant unlabeled images alongside a small labeled set. While current consistency regularization methods achieve strong results, most do not explicitly model boundaries as a separate learning objective. In this paper, we propose BoundMatch, a novel multi‑task SS‑SS framework that explicitly integrates semantic boundary detection into a teacher‑student consistency regularization pipeline. Our core mechanism, Boundary Consistency Regularized Multi‑Task Learning (BCRM), enforces prediction agreement between teacher and student models on both segmentation masks and detailed semantic boundaries, providing complementary supervision from two independent tasks. To further enhance performance and encourage sharper boundaries, BoundMatch incorporates two lightweight fusion modules: Boundary‑Semantic Fusion (BSF) injects learned boundary cues into the segmentation decoder, while Spatial Gradient Fusion (SGF) refines boundary predictions using mask gradients, yielding more reliable boundary pseudo‑labels. This framework is built upon SAMTH, a strong teacher‑student baseline featuring a Harmonious Batch Normalization (HBN) update strategy for improved stability. Extensive experiments on diverse datasets including Cityscapes and Pascal VOC show that BoundMatch achieves competitive performance against current state‑of‑the‑art methods. Our approach achieves state‑of‑the‑art results on the new Cityscapes benchmark with DINOv2 foundation model. Ablation studies highlight BoundMatch's ability to improve boundary‑specific evaluation metrics, its effectiveness in realistic large‑scale unlabeled data scenario, and applicability to lightweight architectures for mobile deployment.

Abstract:
The global rise in the number of people with physical disabilities, in part due to improvements in post‑trauma survivorship and longevity, has amplified the demand for advanced assistive technologies to improve mobility and independence. Autonomous assistive robots, such as smart wheelchairs, require robust capabilities in spatial segmentation and semantic recognition to navigate complex built environments effectively. Place segmentation involves delineating spatial regions like rooms or functional areas, while semantic recognition assigns semantic labels to these regions, enabling accurate localization to user‑specific needs. Existing approaches often utilize deep learning; however, these close‑vocabulary detection systems struggle to interpret intuitive and casual human instructions. Additionally, most existing methods ignore the uncertainty of the scene recognition problem, leading to low success rates, particularly in ambiguous and complex environments. To address these challenges, we propose an open‑vocabulary scene semantic segmentation and detection pipeline leveraging Vision Language Models (VLMs) and Large Language Models (LLMs). Our approach follows a 'Segment Detect Select' framework for open‑vocabulary scene classification, enabling adaptive and intuitive navigation for assistive robots in built environments.

Abstract:
Aerial and satellite imagery are inherently complementary remote sensing sources, offering high‑resolution detail alongside expansive spatial coverage. However, the use of these sources for land cover segmentation introduces several challenges, prompting the development of a variety of segmentation methods. Among these approaches, the DeepLabV3+ architecture is considered as a promising approach in the field of single‑source image segmentation. However, despite its reliable results for segmentation, there is still a need to increase its robustness and improve its performance. This is particularly crucial for multimodal image segmentation, where the fusion of diverse types of information is essential. An interesting approach involves enhancing this architectural framework through the integration of novel components and the modification of certain internal processes. In this paper, we enhance the DeepLabV3+ architecture by introducing a new transposed conventional layers block for upsampling a second entry to fuse it with high level features. This block is designed to amplify and integrate information from satellite images, thereby enriching the segmentation process through fusion with aerial images. For experiments, we used the LandCover.ai (Land Cover from Aerial Imagery) dataset for aerial images, alongside the corresponding dataset sourced from Sentinel 2 data. Through the fusion of both sources, the mean Intersection over Union (mIoU) achieved a total mIoU of 84.91% without data augmentation.

Abstract:
Sonar sensing is fundamental for underwater robotics, but limited by capabilities of AI systems, which need large training datasets. Public data in sonar modalities is lacking. This paper presents the Marine Debris Forward‑Looking Sonar datasets, with three different settings (watertank, turntable, flooded quarry) increasing dataset diversity and multiple computer vision tasks: object classification, object detection, semantic segmentation, patch matching, and unsupervised learning. We provide full dataset description, basic analysis and initial results for some tasks. We expect the research community will benefit from this dataset, which is publicly available at https://doi.org/10.5281/zenodo.15101686

Abstract:
Purpose: The distribution of visceral adipose tissue (VAT) in cystectomy patients is indicative of the incidence of post‑operative complications. Existing VAT segmentation methods for computed tomography (CT) employing intensity thresholding have limitations relating to inter‑observer variability. Moreover, the difficulty in creating ground‑truth masks limits the development of deep learning (DL) models for this task. This paper introduces a novel method for VAT prediction in pre‑cystectomy CT, which is fully automated and does not require ground‑truth VAT masks for training, overcoming aforementioned limitations. Methods: We introduce the Kernel density Enhanced VAT Segmentator ( KEVS), combining a DL semantic segmentation model, for multi‑body feature prediction, with Gaussian kernel density estimation analysis of predicted subcutaneous adipose tissue to achieve accurate scan‑specific predictions of VAT in the abdominal cavity. Uniquely for a DL pipeline, KEVS does not require ground‑truth VAT masks. Results: We verify the ability of KEVS to accurately segment abdominal organs in unseen CT data and compare KEVS VAT segmentation predictions to existing state‑of‑the‑art (SOTA) approaches in a dataset of 20 pre‑cystectomy CT scans, collected from University College London Hospital (UCLH‑Cyst), with expert ground‑truth annotations. KEVS presents a 4.80% and 6.02% improvement in Dice Coefficient over the second best DL and thresholding‑based VAT segmentation techniques respectively when evaluated on UCLH‑Cyst. Conclusion: This research introduces KEVS; an automated, SOTA method for the prediction of VAT in pre‑cystectomy CT which eliminates inter‑observer variability and is trained entirely on open‑source CT datasets which do not contain ground‑truth VAT masks.

Abstract:
Before deployment in the real‑world deep neural networks require thorough evaluation of how they handle both knowns, inputs represented in the training data, and unknowns (anomalies). This is especially important for scene understanding tasks with safety critical applications, such as in autonomous driving. Existing datasets allow evaluation of only knowns or unknowns ‑ but not both, which is required to establish "in the wild" suitability of deep neural network models. To bridge this gap, we propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real‑world environments. The dataset is twice larger than existing anomaly segmentation datasets, and provides a training, validation and test set for controlled in‑domain evaluation. The test set consists of a static and temporal part, with the latter comprised of videos. The dataset provides annotations for both closed‑set (knowns) and anomalies, enabling closed‑set and open‑set evaluation. The dataset covers diverse conditions, such as domain and cross‑sensor shift, illumination variation and allows ablation of anomaly detection methods with respect to these variations. Evaluation results of current state‑of‑the‑art methods confirm the need for improvements especially in domain‑generalization, small and large object segmentation.

Abstract:
This paper addresses the challenge of data scarcity in semantic segmentation by generating datasets through text‑to‑image (T2I) generation models, reducing image acquisition and labeling costs. Segmentation dataset generation faces two key challenges: 1) aligning generated samples with the target domain and 2) producing informative samples beyond the training data. Fine‑tuning T2I models can help generate samples aligned with the target domain. However, it often overfits and memorizes training data, limiting their ability to generate diverse and well‑aligned samples. To overcome these issues, we propose Concept‑Aware LoRA (CA‑LoRA), a novel fine‑tuning approach that selectively identifies and updates only the weights associated with necessary concepts (e.g., style or viewpoint) for domain alignment while preserving the pretrained knowledge of the T2I model to produce informative samples. We demonstrate its effectiveness in generating datasets for urban‑scene segmentation, outperforming baseline and state‑of‑the‑art methods in in‑domain (few‑shot and fully‑supervised) settings, as well as in domain generalization tasks, especially under challenging conditions such as adverse weather and varying illumination, further highlighting its superiority.

Abstract:
In this work, we focus on continual semantic segmentation (CSS), where segmentation networks are required to continuously learn new classes without erasing knowledge of previously learned ones. Although storing images of old classes and directly incorporating them into the training of new models has proven effective in mitigating catastrophic forgetting in classification tasks, this strategy presents notable limitations in CSS. Specifically, the stored and new images with partial category annotations leads to confusion between unannotated categories and the background, complicating model fitting. To tackle this issue, this paper proposes a novel Enhanced Instance Replay (EIR) method, which not only preserves knowledge of old classes while simultaneously eliminating background confusion by instance storage of old classes, but also mitigates background shifts in the new images by integrating stored instances with new images. By effectively resolving background shifts in both stored and new images, EIR alleviates catastrophic forgetting in the CSS task, thereby enhancing the model's capacity for CSS. Experimental results validate the efficacy of our approach, which significantly outperforms state‑of‑the‑art CSS methods.

Abstract:
Automated construction is one of the most promising areas that can improve efficiency, reduce costs and minimize errors in the process of building construction. In this paper, a comparative analysis of three neural network models for semantic segmentation, U‑Net(light), LinkNet and PSPNet, is performed. Two specialized datasets with images of houses built from wooden cubes were created for the experiments. The first dataset contains 4 classes (background, foundation, walls, roof ) and is designed for basic model evaluation, while the second dataset includes 44 classes where each cube is labeled as a separate object. The models were trained with the same hyperparameters and their accuracy was evaluated using MeanIoU and F1 Score metrics. According to the results obtained, U‑Net(light) showed the best performance with 78% MeanIoU and 87% F1 Score on the first dataset and 17% and 25% respectively on the second dataset. The poor results on the second dataset are due to the limited amount of data, the complexity of the partitioning and the imbalance of classes, making it difficult to accurately select individual cubes. In addition, overtraining was observed in all experiments, manifested by high accuracy on the training dataset and its significant decrease on the validation dataset. The present work is the basis for the development of algorithms for automatic generation of staged building plans, which can be further scaled to design complete buildings. Future research is planned to extend the datasets and apply methods to combat overfitting (L1/L2 regularization, Early Stopping). The next stage of work will be the development of algorithms for automatic generation of a step‑by‑step plan for building houses from cubes using manipulators. Index Terms‑Deep Learning, Computer vision, CNN, Semantic segmentation, Construction materials.

Abstract:
Immersive communication has made significant advancements, especially with the release of the codec for Immersive Voice and Audio Services. Aiming at its further realization, the DCASE 2025 Challenge has recently introduced a task for spatial semantic segmentation of sound scenes (S5), which focuses on detecting and separating sound events in spatial sound scenes. In this paper, we explore methods for addressing the S5 task. Specifically, we present baseline S5 systems that combine audio tagging (AT) and label‑queried source separation (LSS) models. We investigate two LSS approaches based on the ResUNet architecture: a) extracting a single source for each detected event and b) querying multiple sources concurrently. Since each separated source in S5 is identified by its sound event class label, we propose new class‑aware metrics to evaluate both the sound sources and labels simultaneously. Experimental results on first‑order ambisonics spatial audio demonstrate the effectiveness of the proposed systems and confirm the efficacy of the metrics.

Abstract:
As a fundamental task in computer vision, semantic segmentation is widely applied in fields such as autonomous driving, remote sensing image analysis, and medical image processing. In recent years, Transformer‑based segmentation methods have demonstrated strong performance in global feature modeling. However, they still struggle with blurred target boundaries and insufficient recognition of small targets. To address these issues, this study proposes a Mask2Former‑based semantic segmentation algorithm incorporating a boundary enhancement feature bridging module (BEFBM). The goal is to improve target boundary accuracy and segmentation consistency. Built upon the Mask2Former framework, this method constructs a boundary‑aware feature map and introduces a feature bridging mechanism. This enables effective cross‑scale feature fusion, enhancing the model's ability to focus on target boundaries. Experiments on the Cityscapes dataset demonstrate that, compared to mainstream segmentation methods, the proposed approach achieves significant improvements in metrics such as mIOU, mDICE, and mRecall. It also exhibits superior boundary retention in complex scenes. Visual analysis further confirms the model's advantages in fine‑grained regions. Future research will focus on optimizing computational efficiency and exploring its potential in other high‑precision segmentation tasks.

Abstract:
News videos require efficient content organisation and retrieval systems, but their unstructured nature poses significant challenges for automated processing. This paper presents a comprehensive comparative analysis of image, video, and audio classifiers for automated news video segmentation. This work presents the development and evaluation of multiple deep learning approaches, including ResNet, ViViT, AST, and multimodal architectures, to classify five distinct segment types: advertisements, stories, studio scenes, transitions, and visualisations. Using a custom‑annotated dataset of 41 news videos comprising 1,832 scene clips, our experiments demonstrate that image‑based classifiers achieve superior performance (84.34% accuracy) compared to more complex temporal models. Notably, the ResNet architecture outperformed state‑of‑the‑art video classifiers while requiring significantly fewer computational resources. Binary classification models achieved high accuracy for transitions (94.23%) and advertisements (92.74%). These findings advance the understanding of effective architectures for news video segmentation and provide practical insights for implementing automated content organisation systems in media applications. These include media archiving, personalised content delivery, and intelligent video search.

Abstract:
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high‑level commands without requiring explicit step‑by‑step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine‑tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine‑tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine‑tuning. Our innovation is the introduction of a just‑in‑time digital twin concept, where ‑‑ given an implicit query ‑‑ a LLM plans the construction of a low‑level scene representation from high‑level video using specialist vision models. We refer to this approach to creating a digital twin as "just‑in‑time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS) with image‑level labels aims to achieve pixel‑level predictions using Class Activation Maps (CAMs). Recently, Contrastive Language‑Image Pre‑training (CLIP) has been introduced in WSSS. However, recent methods primarily focus on image‑text alignment for CAM generation, while CLIP's potential in patch‑text alignment remains unexplored. In this work, we propose ExCEL to explore CLIP's dense knowledge via a novel patch‑text alignment paradigm for WSSS. Specifically, we propose Text Semantic Enrichment (TSE) and Visual Calibration (VC) modules to improve the dense alignment across both text and vision modalities. To make text embeddings semantically informative, our TSE module applies Large Language Models (LLMs) to build a dataset‑wide knowledge base and enriches the text representations with an implicit attribute‑hunting process. To mine fine‑grained knowledge from visual features, our VC module first proposes Static Visual Calibration (SVC) to propagate fine‑grained knowledge in a non‑parametric manner. Then Learnable Visual Calibration (LVC) is further proposed to dynamically shift the frozen features towards distributions with diverse semantics. With these enhancements, ExCEL not only retains CLIP's training‑free advantages but also significantly outperforms other state‑of‑the‑art methods with much less training cost on PASCAL VOC and MS COCO.

Abstract:
The potential of tree planting as a natural climate solution is often undermined by inadequate monitoring of tree planting projects. Current monitoring methods involve measuring trees by hand for each species, requiring extensive cost, time, and labour. Advances in drone remote sensing and computer vision offer great potential for mapping and characterizing trees from aerial imagery, and large pre‑trained vision models, such as the Segment Anything Model (SAM), may be a particularly compelling choice given limited labeled data. In this work, we compare SAM methods for the task of automatic tree crown instance segmentation in high resolution drone imagery of young tree plantations. We explore the potential of SAM for this task, and find that methods using SAM out‑of‑the‑box do not outperform a custom Mask R‑CNN, even with well‑designed prompts, but that there is potential for methods which tune SAM further. We also show that predictions can be improved by adding Digital Surface Model (DSM) information as an input.

Abstract:
Uncertainty Quantification (UQ) is crucial for ensuring the reliability of machine learning models deployed in real‑world autonomous systems. However, existing approaches typically quantify task‑level output prediction uncertainty without considering epistemic uncertainty at the multimodal feature fusion level, leading to sub‑optimal outcomes. Additionally, popular uncertainty quantification methods, e.g., Bayesian approximations, remain challenging to deploy in practice due to high computational costs in training and inference. In this paper, we propose HyperDUM, a novel deterministic uncertainty method (DUM) that efficiently quantifies feature‑level epistemic uncertainty by leveraging hyperdimensional computing. Our method captures the channel and spatial uncertainties through channel and patch wise projection and bundling techniques respectively. Multimodal sensor features are then adaptively weighted to mitigate uncertainty propagation and improve feature fusion. Our evaluations show that HyperDUM on average outperforms the state‑of‑the‑art (SOTA) algorithms by up to 2.01%/1.27% in 3D Object Detection and up to 1.29% improvement over baselines in semantic segmentation tasks under various types of uncertainties. Notably, HyperDUM requires 2.36x less Floating Point Operations and up to 38.30x less parameters than SOTA methods, providing an efficient solution for real‑world autonomous systems.

Abstract:
Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general‑purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state‑of‑the‑art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.

Abstract:
Large Vision‑Language Models (VLMs) are increasingly being regarded as foundation models that can be instructed to solve diverse tasks by prompting, without task‑specific training. We examine the seemingly obvious question: how to effectively prompt VLMs for semantic segmentation. To that end, we systematically evaluate the segmentation performance of several recent models guided by either text or visual prompts on the out‑of‑distribution MESS dataset collection. We introduce a scalable prompting scheme, few‑shot prompted semantic segmentation, inspired by open‑vocabulary segmentation and few‑shot learning. It turns out that VLMs lag far behind specialist models trained for a specific segmentation task, by about 30% on average on the Intersection‑over‑Union metric. Moreover, we find that text prompts and visual prompts are complementary: each one of the two modes fails on many examples that the other one can solve. Our analysis suggests that being able to anticipate the most effective prompt modality can lead to a 11% improvement in performance. Motivated by our findings, we propose PromptMatcher, a remarkably simple training‑free baseline that combines both text and visual prompts, achieving state‑of‑the‑art results outperforming the best text‑prompted VLM by 2.5%, and the top visual‑prompted VLM by 3.5% on few‑shot prompted semantic segmentation.

Abstract:
In recent years, the rapid development of machine learning has brought reforms and challenges to traditional communication systems. Semantic communication has appeared as an effective strategy to effectively extract relevant semantic signals semantic segmentation labels and image features for image transmission. However, the insufficient number of extracted semantic features of images will potentially result in a low reconstruction accuracy, which hinders the practical applications and still remains challenging for solving. In order to fill this gap, this letter proposes a multi‑text transmission semantic communication (Multi‑SC) system, which uses the visual language model (VLM) to assist in the transmission of image semantic signals. Unlike previous image transmission semantic communication systems, the proposed system divides the image into multiple blocks and extracts multiple text information from the image using a modified large language and visual assistant (LLaVA), and combines semantic segmentation tags with semantic text for image recovery. Simulation results show that the proposed text semantics diversity scheme can significantly improve the reconstruction accuracy compared with related works.

Abstract:
RGB‑T road scene semantic segmentation enhances visual scene understanding in complex environments characterized by inadequate illumination or occlusion by fusing information from RGB and thermal images. Nevertheless, existing RGB‑T semantic segmentation models typically depend on simple addition or concatenation strategies or ignore the differences between information at different levels. To address these issues, we proposed a novel RGB‑T road scene semantic segmentation network called Brain‑Inspired Multi‑Iteration Interaction Network (BIMII‑Net). First, to meet the requirements of accurate texture and local information extraction in road scenarios like autonomous driving, we proposed a deep continuous‑coupled neural network (DCCNN) architecture based on a brain‑inspired model. Second, to enhance the interaction and expression capabilities among multi‑modal information, we designed a cross explicit attention‑enhanced fusion module (CEAEF‑Module) in the feature fusion stage of BIMII‑Net to effectively integrate features at different levels. Finally, we constructed a complementary interactive multi‑layer decoder structure, incorporating the shallow‑level feature iteration module (SFI‑Module), the deep‑level feature iteration module (DFI‑Module), and the multi‑feature enhancement module (MFE‑Module) to collaboratively extract texture details and global skeleton information, with multi‑module joint supervision further optimizing the segmentation results. Experimental results demonstrate that BIMII‑Net achieves state‑of‑the‑art (SOTA) performance in the brain‑inspired computing domain and outperforms most existing RGB‑T semantic segmentation methods. It also exhibits strong generalization capabilities on multiple RGB‑T datasets, proving the effectiveness of brain‑inspired computer models in multi‑modal image segmentation tasks.

Abstract:
Feature Coding for Machines (FCM) aims to compress intermediate features effectively for remote intelligent analytics, which is crucial for future intelligent visual applications. In this paper, we propose a Multiscale Feature Importance‑based Bit Allocation (MFIBA) for end‑to‑end FCM. First, we find that the importance of features for machine vision tasks varies with the scales, object size, and image instances. Based on this finding, we propose a Multiscale Feature Importance Prediction (MFIP) module to predict the importance weight for each scale of features. Secondly, we propose a task loss‑rate model to establish the relationship between the task accuracy losses of using compressed features and the bitrate of encoding these features. Finally, we develop a MFIBA for end‑to‑end FCM, which is able to assign coding bits of multiscale features more reasonably based on their importance. Experimental results demonstrate that when combined with a retained Efficient Learned Image Compression (ELIC), the proposed MFIBA achieves an average of 38.202% bitrate savings in object detection compared to the anchor ELIC. Moreover, the proposed MFIBA achieves an average of 17.212% and 36.492% feature bitrate savings for instance segmentation and keypoint detection, respectively. When the proposed MFIBA is applied to the LIC‑TCM, it achieves an average of 18.103%, 19.866% and 19.597% bit rate savings on three machine vision tasks, respectively, which validates the proposed MFIBA has good generalizability and adaptability to different machine vision tasks and FCM base codecs.

Abstract:
Semantic segmentation has made significant strides in pixel‑level image understanding, yet it remains limited in capturing contextual and semantic relationships between objects. Current models, such as CNN and Transformer‑based architectures, excel at identifying pixel‑level features but fail to distinguish semantically similar objects (e.g., "doctor" vs. "nurse" in a hospital scene) or understand complex contextual scenarios (e.g., differentiating a running child from a regular pedestrian in autonomous driving). To address these limitations, we proposed a novel Context‑Aware Semantic Segmentation framework that integrates Large Language Models (LLMs) with state‑of‑the‑art vision backbones. Our hybrid model leverages the Swin Transformer for robust visual feature extraction and GPT‑4 for enriching semantic understanding through text embeddings. A Cross‑Attention Mechanism is introduced to align vision and language features, enabling the model to reason about context more effectively. Additionally, Graph Neural Networks (GNNs) are employed to model object relationships within the scene, capturing dependencies that are overlooked by traditional models. Experimental results on benchmark datasets (e.g., COCO, Cityscapes) demonstrate that our approach outperforms the existing methods in both pixel‑level accuracy (mIoU) and contextual understanding (mAP). This work bridges the gap between vision and language, paving the path for more intelligent and context‑aware vision systems in applications including autonomous driving, medical imaging, and robotics.

Abstract:
Vision foundation models (VFMs) trained on large‑scale image datasets provide high‑quality features that have significantly advanced 2D visual recognition. However, their potential in 3D scene segmentation remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. While significant research has been dedicated to 2D‑3D fusion, recent state‑of‑the‑art 3D methods predominantly focus on 3D data, leaving the integration of VFMs into 3D models underexplored. In this work, we challenge this trend by introducing DITR, a generally applicable approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model. DITR achieves state‑of‑the‑art results on both indoor and outdoor 3D semantic segmentation benchmarks. To enable the use of VFMs even when images are unavailable during inference, we additionally propose to pretrain 3D models by distilling 2D foundation models. By initializing the 3D backbone with knowledge distilled from 2D VFMs, we create a strong basis for downstream 3D segmentation tasks, ultimately boosting performance across various datasets.

Abstract:
While CNNs were long considered state of the art for image processing, the introduction of Transformer architectures has challenged this position. While achieving excellent results in image classification and segmentation, Transformers remain inherently reliant on large training datasets and remain computationally expensive. A newly introduced Transformer derivative named KV Transformer shows promising results in synthetic, NLP, and image classification tasks, while reducing complexity and memory usage. This is especially conducive to use cases where local inference is required, such as medical screening applications. We endeavoured to further evaluate the merit of KV Transformers on semantic segmentation tasks, specifically in the domain of medical imaging. By directly comparing traditional and KV variants of the same base architectures, we provide further insight into the practical tradeoffs of reduced model complexity. We observe a notable reduction in parameter count and multiply accumulate operations, while achieving similar performance from most of the KV variant models when directly compared to their QKV implementation.

Abstract:
Recent advances in self‑supervised learning have led to the development of foundation models that have significantly advanced performance in various computer vision tasks. However, despite their potential, these models often overlook the crucial role of high‑resolution digital surface models (DSMs) in understanding urban environments, particularly for building‑level analysis, which is essential for applications like digital twins. To address this gap, we introduce HiRes‑FusedMIM, a novel pre‑trained model specifically designed to leverage the rich information contained within high‑resolution RGB and DSM data. HiRes‑FusedMIM utilizes a dual‑encoder simple masked image modeling (SimMIM) architecture with a multi‑objective loss function that combines reconstruction and contrastive objectives, enabling it to learn powerful, joint representations from both modalities. We conducted a comprehensive evaluation of HiRes‑FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation. Our results demonstrate that: 1) HiRes‑FusedMIM outperforms previous state‑of‑the‑art geospatial methods on several building‑related datasets, including WHU Aerial and LoveDA, demonstrating its effectiveness in capturing and leveraging fine‑grained building information; 2) Incorporating DSMs during pre‑training consistently improves performance compared to using RGB data alone, highlighting the value of elevation information for building‑level analysis; 3) The dual‑encoder architecture of HiRes‑FusedMIM, with separate encoders for RGB and DSM data, significantly outperforms a single‑encoder model on the Vaihingen segmentation task, indicating the benefits of learning specialized representations for each modality. To facilitate further research and applications in this direction, we will publicly release the trained model weights.

Abstract:
The integration of RGB and depth modalities significantly enhances the accuracy of segmenting complex indoor scenes, with depth data from RGB‑D cameras playing a crucial role in this improvement. However, collecting an RGB‑D dataset is more expensive than an RGB dataset due to the need for specialized depth sensors. Aligning depth and RGB images also poses challenges due to sensor positioning and issues like missing data and noise. In contrast, Pseudo Depth (PD) from high‑precision depth estimation algorithms can eliminate the dependence on RGB‑D sensors and alignment processes, as well as provide effective depth information and show significant potential in semantic segmentation. Therefore, to explore the practicality of utilizing pseudo depth instead of real depth for semantic segmentation, we design an RGB‑PD segmentation pipeline to integrate RGB and pseudo depth and propose a Pseudo Depth Aggregation Module (PDAM) for fully exploiting the informative clues provided by the diverse pseudo depth maps. The PDAM aggregates multiple pseudo depth maps into a single modality, making it easily adaptable to other RGB‑D segmentation methods. In addition, the pre‑trained diffusion model serves as a strong feature extractor for RGB segmentation tasks, but multi‑modal diffusion‑based segmentation methods remain unexplored. Therefore, we present a Pseudo Depth Diffusion Model (PDDM) that adopts a large‑scale text‑image diffusion model as a feature extractor and a simple yet effective fusion strategy to integrate pseudo depth. To verify the applicability of pseudo depth and our PDDM, we perform extensive experiments on the NYUv2 and SUNRGB‑D datasets. The experimental results demonstrate that pseudo depth can effectively enhance segmentation performance, and our PDDM achieves state‑of‑the‑art performance, outperforming other methods by +6.98 mIoU on NYUv2 and +2.11 mIoU on SUNRGB‑D.

Abstract:
Vision Foundation Models (VFMs) excel in generalization due to large‑scale pretraining, but fine‑tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine‑tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs' full potential in DGSS tasks. We observe that domain‑sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization. To address this, we propose FisherTune, a robust fine‑tuning method guided by the Domain‑Related Fisher Information Matrix (DR‑FIM). DR‑FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR‑FIM estimation, treating parameters as Gaussian‑distributed variables and leveraging pre‑trained priors. Extensive experiments show that FisherTune achieves superior cross‑domain segmentation while maintaining generalization, outperforming selective‑parameter and adapter‑based methods.

Abstract:
Contemporary Video Instance Segmentation (VIS) methods typically adhere to a pre‑train then fine‑tune regime, where a segmentation model trained on images is fine‑tuned on videos. However, the lack of temporal knowledge in the pre‑trained model introduces a domain gap which may adversely affect the VIS performance. To effectively bridge this gap, we present a novel video pre‑training approach to enhance VIS models, especially for videos with intricate instance relationships. Our crucial innovation focuses on reducing disparities between the pre‑training and fine‑tuning stages. Specifically, we first introduce consistent pseudo‑video augmentations to create diverse pseudo‑video samples for pre‑training while maintaining the instance consistency across frames. Then, we incorporate a multi‑scale temporal module to enhance the model's ability to model temporal relations through self‑ and cross‑attention at short‑ and long‑term temporal spans. Our approach does not set constraints on model architecture and can integrate seamlessly with various VIS methods. Experiment results on commonly adopted VIS benchmarks show that our method consistently outperforms state‑of‑the‑art methods. Our approach achieves a notable 4.0% increase in average precision on the challenging OVIS dataset.

Abstract:
LiDAR‑based 3D object detection and semantic segmentation are critical tasks in 3D scene understanding. Traditional detection and segmentation methods supervise their models through bounding box labels and semantic mask labels. However, these two independent labels inherently contain significant redundancy. This paper aims to eliminate the redundancy by supervising 3D object detection using only semantic labels. However, the challenge arises due to the incomplete geometry structure and boundary ambiguity of point‑cloud instances, leading to inaccurate pseudo labels and poor detection results. To address these challenges, we propose a novel method, named Seg2Box. We first introduce a Multi‑Frame Multi‑Scale Clustering (MFMS‑C) module, which leverages the spatio‑temporal consistency of point clouds to generate accurate box‑level pseudo‑labels. Additionally, the Semantic?Guiding Iterative‑Mining Self‑Training (SGIM‑ST) module is proposed to enhance the performance by progressively refining the pseudo‑labels and mining the instances without generating pseudo‑labels. Experiments on the Waymo Open Dataset and nuScenes Dataset show that our method significantly outperforms other competitive methods by 23.7% and 10.3% in mAP, respectively. The results demonstrate the great label‑efficient potential and advancement of our method.

Abstract:
3D Gaussian Splatting (3DGS) has significantly improved the efficiency and realism of three‑dimensional scene visualization in several applications, ranging from robotics to eXtended Reality (XR). This work presents SAGE (Semantic‑Driven Adaptive Gaussian Splatting in Extended Reality), a novel framework designed to enhance the user experience by dynamically adapting the Level of Detail (LOD) of different 3DGS objects identified via a semantic segmentation. Experimental results demonstrate how SAGE effectively reduces memory and computational overhead while keeping a desired target visual quality, thus providing a powerful optimization for interactive XR applications.

Abstract:
Vision‑language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open‑vocabulary object detection (OVD), instance segmentation, and tracking leverages their strengths while mitigating these drawbacks. We utilize VLM‑generated structured descriptions to identify visible object instances, collect application‑relevant attributes, and inform an open‑vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing segmentation masks and tracking. Once initialized, this model directly extracts segmentation masks, processing image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel‑level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task‑specific attributes from non‑standard objects in dynamic environments. Code, data, videos, and benchmarks are available at https://vlm‑gist.github.io

Abstract:
Existing autonomous driving datasets are predominantly oriented towards well‑structured urban settings and favourable weather conditions, leaving the complexities of rural environments and adverse weather conditions largely unaddressed. Although some datasets encompass variations in weather and lighting, bad weather scenarios do not appear often. Rainfall can significantly impair sensor functionality, introducing noise and reflections in LiDAR and camera data and reducing the system's capabilities for reliable environmental perception and safe navigation. This paper introduces the Panoptic‑CUDAL dataset, a novel dataset purpose‑built for panoptic segmentation in rural areas subject to rain. By recording high‑resolution LiDAR, camera, and pose data, Panoptic‑CUDAL offers a diverse, information‑rich dataset in a challenging scenario. We present the analysis of the recorded data and provide baseline results for panoptic, semantic segmentation, and 3D occupancy prediction methods on LiDAR point clouds. The dataset can be found here: https://robotics.sydney.edu.au/our‑research/intelligent‑transportation‑systems, https://vision.rwth‑aachen.de/panoptic‑cudal

Abstract:
We present a novel approach for controllable, region‑specific style editing driven by textual prompts. Building upon the state‑space style alignment framework introduced by \emphStyleMamba, our method integrates a semantic segmentation model into the style transfer pipeline. This allows users to selectively apply text‑driven style changes to specific segments (e.g., ``turn the building into a cyberpunk tower'') while leaving other regions (e.g., ``people'' or ``trees'') unchanged. By incorporating region‑wise condition vectors and a region‑specific directional loss, our method achieves high‑fidelity transformations that respect both semantic boundaries and user‑driven style descriptions. Extensive experiments demonstrate that our approach can flexibly handle complex scene stylizations in real‑world scenarios, improving control and quality over purely global style transfer methods.

Abstract:
Existing domain generalization methods for LiDAR semantic segmentation under adverse weather struggle to accurately predict "things" categories compared to "stuff" categories. In typical driving scenes, "things" categories can be dynamic and associated with higher collision risks, making them crucial for safe navigation and planning. Recognizing the importance of "things" categories, we identify their performance drop as a serious bottleneck in existing approaches. We observed that adverse weather induces degradation of semantic‑level features and both corruption of local features, leading to a misprediction of "things" as "stuff". To mitigate these corruptions, we suggest our method, NTN ‑ segmeNt Things for No‑accident. To address semantic‑level feature corruption, we bind each point feature to its superclass, preventing the misprediction of things classes into visually dissimilar categories. Additionally, to enhance robustness against local corruption caused by adverse weather, we define each LiDAR beam as a local region and propose a regularization term that aligns the clean data with its corrupted counterpart in feature space. NTN achieves state‑of‑the‑art performance with a +2.6 mIoU gain on the SemanticKITTI‑to‑SemanticSTF benchmark and +7.9 mIoU on the SemanticPOSS‑to‑SemanticSTF benchmark. Notably, NTN achieves a +4.8 and +7.9 mIoU improvement on "things" classes, respectively, highlighting its effectiveness.

Abstract:
This study explores the integration of machine learning into urban aerial image analysis, with a focus on identifying infrastructure surfaces for cars and pedestrians and analyzing historical trends. It emphasizes the transition from convolutional architectures to transformer‑based pre‑trained models, underscoring their potential in global geospatial analysis. A workflow is presented for automatically generating geospatial datasets, enabling the creation of semantic segmentation datasets from various sources, including WMS/WMTS links, vectorial cartography, and OpenStreetMap (OSM) overpass‑turbo requests. The developed code allows a fast dataset generation process for training machine learning models using openly available data without manual labelling. Using aerial imagery and vectorial data from the respective geographical offices of Madrid and Vienna, two datasets were generated for car and pedestrian surface detection. A transformer‑based model was trained and evaluated for each city, demonstrating good accuracy values. The historical trend analysis involved applying the trained model to earlier images predating the availability of vectorial data 10 to 20 years, successfully identifying temporal trends in infrastructure for pedestrians and cars across different city areas. This technique is applicable for municipal governments to gather valuable data at a minimal cost.

Abstract:
Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero‑shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real‑world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg‑Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real‑world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real‑world dataset.

Abstract:
The increasing demand for high‑accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM‑Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM‑Net employs a dual‑pathway architecture, which combines a pre‑trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM‑Net achieves superior performance metrics, including a Global Difference (GD) of 3.61% and an End‑Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM‑Net in applications demanding high‑precision depth data.

Abstract:
Accurate perception of dynamic traffic scenes is crucial for high‑level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio‑temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi‑task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full‑resolution point clouds. The novelty of this work is threefold: 1) developing a coarse‑to‑fine prediction based multi‑task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self‑supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self‑supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self‑supervised methods in dynamic scene understanding.

Abstract:
This research paper presents an innovative ship detection system tailored for applications like maritime surveillance and ecological monitoring. The study employs YOLOv8 and repurposed U‑Net, two advanced deep learning models, to significantly enhance ship detection accuracy. Evaluation metrics include Mean Average Precision (mAP), processing speed, and overall accuracy. The research utilizes the "Airbus Ship Detection" dataset, featuring diverse remote sensing images, to assess the models' versatility in detecting ships with varying orientations and environmental contexts. Conventional ship detection faces challenges with arbitrary orientations, complex backgrounds, and obscured perspectives. Our approach incorporates YOLOv8 for real‑time processing and U‑Net for ship instance segmentation. Evaluation focuses on mAP, processing speed, and overall accuracy. The dataset is chosen for its diverse images, making it an ideal benchmark. Results demonstrate significant progress in ship detection. YOLOv8 achieves an 88% mAP, excelling in accurate and rapid ship detection. U Net, adapted for ship instance segmentation, attains an 89% mAP, improving boundary delineation and handling occlusions. This research enhances maritime surveillance, disaster response, and ecological monitoring, exemplifying the potential of deep learning models in ship detection.

Abstract:
Scribble‑based weakly supervised semantic segmentation leverages only a few annotated pixels as labels to train a segmentation model, presenting significant potential for reducing the human labor involved in the annotation process. This approach faces two primary challenges: first, the sparsity of scribble annotations can lead to inconsistent predictions due to limited supervision; second, the variability in scribble annotations, reflecting differing human annotator preferences, can prevent the model from consistently capturing the discriminative regions of objects, potentially leading to unstable predictions. To address these issues, we propose a holistic framework, the class‑driven scribble promotion network, for robust scribble‑supervised semantic segmentation. This framework not only utilizes the provided scribble annotations but also leverages their associated class labels to generate reliable pseudo‑labels. Within the network, we introduce a localization rectification module to mitigate noisy labels and a distance perception module to identify reliable regions surrounding scribble annotations and pseudo‑labels. In addition, we introduce new large‑scale benchmarks, ScribbleCOCO and ScribbleCityscapes, accompanied by a scribble simulation algorithm that enables evaluation across varying scribble styles. Our method demonstrates competitive performance in both accuracy and robustness, underscoring its superiority over existing approaches. The datasets and the codes will be made publicly available.

Abstract:
Single Domain Generalization (SDG) aims to train models that maintain consistent performance across diverse scenarios using data from a single source. While latent diffusion models (LDMs) show promise for augmenting limited source data, our analysis reveals that directly employing synthetic data may not only fail to provide benefits but can actually compromise performance due to substantial feature distribution discrepancies between synthetic and real target domains. To address this issue, we propose Discriminative Domain Reassembly and Soft‑Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization. We employ LDMs to produce diverse pseudo‑target domain samples and introduce two key modules to handle distribution bias. First, Discriminative Feature Decoupling and Reassembly (DFDR) module uses entropy‑guided attention to recalibrate channel‑level features, suppressing synthetic noise while preserving semantic consistency. Second, Multi‑pseudo‑domain Soft Fusion (MDSF) module uses adversarial training with latent‑space feature interpolation, creating continuous feature transitions between domains. Extensive SDG experiments on image classification, object detection, and semantic segmentation demonstrate that DRSF delivers substantial performance gains with only marginal computational overhead. Notably, DRSF's plug‑and‑play architecture enables seamless integration with unsupervised domain adaptation paradigms, underscoring its broad applicability to diverse, real‑world domain challenges.

Abstract:
This study proposes a deep learning framework and annotation methodology for the automatic detection of periodontal bone loss landmarks, associated conditions, and staging. 192 periapical radiographs were collected and annotated with a stage agnostic methodology, labelling clinically relevant landmarks regardless of disease presence or extent. We propose a heuristic post‑processing module that aligns predicted keypoints to tooth boundaries using an auxiliary instance segmentation model. An evaluation metric, Percentage of Relative Correct Keypoints (PRCK), is proposed to capture keypoint performance in dental imaging domains. Four donor pose estimation models were adapted with fine‑tuning for our keypoint problem. Post‑processing improved fine‑grained localisation, raising average PRCK^0.05 by +0.028, but reduced coarse performance for PRCK^0.25 by ‑0.0523 and PRCK^0.5 by ‑0.0345. Orientation estimation shows excellent performance for auxiliary segmentation when filtered with either stage 1 object detection model. Periodontal staging was detected sufficiently, with the best mesial and distal Dice scores of 0.508 and 0.489, while furcation involvement and widened periodontal ligament space tasks remained challenging due to scarce positive samples. Scalability is implied with similar validation and external set performance. The annotation methodology enables stage agnostic training with balanced representation across disease severities for some detection tasks. The PRCK metric provides a domain‑specific alternative to generic pose metrics, while the heuristic post‑processing module consistently corrected implausible predictions with occasional catastrophic failures. The proposed framework demonstrates the feasibility of clinically interpretable periodontal bone loss assessment, with potential to reduce diagnostic variability and clinician workload.

Abstract:
This study proposes a 3D semantic segmentation method for the spine based on the improved SwinUNETR to improve segmentation accuracy and robustness. Aiming at the complex anatomical structure of spinal images, this paper introduces a multi‑scale fusion mechanism to enhance the feature extraction capability by using information of different scales, thereby improving the recognition accuracy of the model for the target area. In addition, the introduction of the adaptive attention mechanism enables the model to dynamically adjust the attention to the key area, thereby optimizing the boundary segmentation effect. The experimental results show that compared with 3D CNN, 3D U‑Net, and 3D U‑Net + Transformer, the model of this study has achieved significant improvements in mIoU, mDice, and mAcc indicators, and has better segmentation performance. The ablation experiment further verifies the effectiveness of the proposed improved method, proving that multi‑scale fusion and adaptive attention mechanism have a positive effect on the segmentation task. Through the visualization analysis of the inference results, the model can better restore the real anatomical structure of the spinal image. Future research can further optimize the Transformer structure and expand the data scale to improve the generalization ability of the model. This study provides an efficient solution for the task of medical image segmentation, which is of great significance to intelligent medical image analysis.

Abstract:
Sound‑guided object segmentation has drawn considerable attention for its potential to enhance multimodal perception. Previous methods primarily focus on developing advanced architectures to facilitate effective audio‑visual interactions, without fully addressing the inherent challenges posed by audio natures, \emph\ie, (1) feature confusion due to the overlapping nature of audio signals, and (2) audio‑visual matching difficulty from the varied sounds produced by the same object. To address these challenges, we propose Dynamic Derivation and Elimination (DDESeg): a novel audio‑visual segmentation framework. Specifically, to mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off‑screen sounds), we propose a dynamic elimination module to filter out non‑matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio‑visual alignment. Comprehensive experiments demonstrate that our framework achieves superior performance in AVS datasets.

Abstract:
Underwater video analysis, hampered by the dynamic marine environment and camera motion, remains a challenging task in computer vision. Existing training‑free video generation techniques, learning motion dynamics on the frame‑by‑frame basis, often produce poor results with noticeable motion interruptions and misaligments. To address these issues, we propose AUTV, a framework for synthesizing marine video data with pixel‑wise annotations. We demonstrate the effectiveness of this framework by constructing two video datasets, namely UTV, a real‑world dataset comprising 2,000 video‑text pairs, and SUTV, a synthetic video dataset including 10,000 videos with segmentation masks for marine objects. UTV provides diverse underwater videos with comprehensive annotations including appearance, texture, camera intrinsics, lighting, and animal behavior. SUTV can be used to improve underwater downstream tasks, which are demonstrated in video inpainting and video object segmentation.

Abstract:
Despite significant advances in deep learning for image and video segmentation, existing models continue to face challenges in cross‑domain adaptability and generalization. Image and video segmentation are fundamental tasks in computer vision with wide‑ranging applications in healthcare, agriculture, industrial inspection, and autonomous driving. With the advent of large‑scale foundation models, SAM2 ‑ an improved version of SAM (Segment Anything Model)has been optimized for segmentation tasks, demonstrating enhanced performance in complex scenarios. However, SAM2's adaptability and limitations in specific domains require further investigation. This paper systematically analyzes the application of SAM2 in image and video segmentation and evaluates its performance in various fields. We begin by introducing the foundational concepts of image segmentation, categorizing foundation models, and exploring the technical characteristics of SAM and SAM2. Subsequently, we delve into SAM2's applications in static image and video segmentation, emphasizing its performance in specialized areas such as medical imaging and the challenges of cross‑domain adaptability. As part of our research, we reviewed over 200 related papers to provide a comprehensive analysis of the topic. Finally, the paper highlights the strengths and weaknesses of SAM2 in segmentation tasks, identifies the technical challenges it faces, and proposes future development directions. This review provides valuable insights and practical recommendations for optimizing and applying SAM2 in real‑world scenarios.

Abstract:
Unsupervised domain adaptation for semantic segmentation (DASS) aims to transfer knowledge from a label‑rich source domain to a target domain with no labels. Two key approaches in DASS are (1) vision‑only approaches using masking or multi‑resolution crops, and (2) language‑based approaches that use generic class‑wise prompts informed by target domain (e.g. "a snowy photo of a class"). However, the former is susceptible to noisy pseudo‑labels that are biased to the source domain. The latter does not fully capture the intricate spatial relationships of objects ‑‑ key for dense prediction tasks. To this end, we propose LangDA. LangDA addresses these challenges by, first, learning contextual relationships between objects via VLM‑generated scene descriptions (e.g. "a pedestrian is on the sidewalk, and the street is lined with buildings."). Second, LangDA aligns the entire image features with text representation of this context‑aware scene caption and learns generalized representations via text. With this, LangDA sets the new state‑of‑the‑art across three DASS benchmarks, outperforming existing methods by 2.6%, 1.4% and 3.9%.

Abstract:
Autonomous driving is a safety‑critical application, and it is therefore a top priority that the accompanying assistance systems are able to provide precise information about the surrounding environment of the vehicle. Tasks such as 3D Object Detection deliver an insufficiently detailed understanding of the surrounding scene because they only predict a bounding box for foreground objects. In contrast, 3D Semantic Segmentation provides richer and denser information about the environment by assigning a label to each individual point, which is of paramount importance for autonomous driving tasks, such as navigation or lane changes. To inspire future research, in this review paper, we provide a comprehensive overview of the current state‑of‑the‑art methods in the field of Point Cloud Semantic Segmentation for autonomous driving. We categorize the approaches into projection‑based, 3D‑based and hybrid methods. Moreover, we discuss the most important and commonly used datasets for this task and also emphasize the importance of synthetic data to support research when real‑world data is limited. We further present the results of the different methods and compare them with respect to their segmentation accuracy and efficiency.

Abstract:
Previous works studied how deep neural networks (DNNs) perceive image content in terms of their biases towards different image cues, such as texture and shape. Previous methods to measure shape and texture biases are typically style‑transfer‑based and limited to DNNs for image classification. In this work, we provide a new evaluation procedure consisting of 1) a cue‑decomposition method that comprises two AI‑free data pre‑processing methods extracting shape and texture cues, respectively, and 2) a novel cue‑decomposition shape bias evaluation metric that leverages the cue‑decomposition data. For application purposes we introduce a corresponding cue‑decomposition robustness metric that allows for the estimation of the robustness of a DNN w.r.t. image corruptions. In our numerical experiments, our findings for biases in image classification DNNs align with those of previous evaluation metrics. However, our cue‑decomposition robustness metric shows superior results in terms of estimating the robustness of DNNs. Furthermore, our results for DNNs on the semantic segmentation datasets Cityscapes and ADE20k for the first time shed light into the biases of semantic segmentation DNNs.

Abstract:
Self‑supervised video correspondence learning depends on the ability to accurately associate pixels between video frames that correspond to the same visual object. However, achieving reliable pixel matching without supervision remains a major challenge. To address this issue, recent research has focused on feature learning techniques that aim to encode unique pixel representations for matching. Despite these advances, existing methods still struggle to achieve exact pixel correspondences and often suffer from false matches, limiting their effectiveness in self‑supervised settings. To this end, we explore an efficient self‑supervised Video Correspondence Learning framework (MER) that aims to accurately extract object details from unlabeled videos. First, we design a dedicated Motion Enhancement Engine that emphasizes capturing the dynamic motion of objects in videos. In addition, we introduce a flexible sampling strategy for inter‑pixel correspondence information (Multi‑Cluster Sampler) that enables the model to pay more attention to the pixel changes of important objects in motion. Through experiments, our algorithm outperforms the state‑of‑the‑art competitors on video correspondence learning tasks such as video object segmentation and video object keypoint tracking.

Abstract:
Object state changes in video reveal critical cues about human and agent activity. However, existing methods are limited to temporal localization of when the object is in its initial state (e.g., cheese block) versus when it has completed a state change (e.g., grated cheese), offering no insight into where the change is unfolding. We propose to deepen the problem by introducing the spatially‑progressing object state change segmentation task. The goal is to segment at the pixel‑level those regions of an object that are actionable and those that are transformed. We show that state‑of‑the‑art VLMs and video segmentation methods struggle at this task, underscoring its difficulty and novelty. As an initial baseline, we design a VLM‑based pseudo‑labeling approach, state‑change dynamics constraints, and a novel WhereToChange benchmark built on in‑the‑wild Internet videos. Experiments on two datasets validate both the challenge of the new task as well as the promise of our model for localizing exactly where and how fast objects are changing in video. We further demonstrate useful implications for tracking activity progress to benefit robotic agents. Overall, our work positions spatial OSC segmentation as a new frontier task for video understanding: one that challenges current SOTA methods and invites the community to build more robust, state‑change‑sensitive representations. Project page: https://vision.cs.utexas.edu/projects/spoc‑spatially‑progressing‑osc

Abstract:
Semi‑Supervised Semantic Segmentation reduces reliance on extensive annotations by using unlabeled data and state‑of‑the‑art models to improve overall performance. Despite the success of deep co‑training methods, their underlying mechanisms remain underexplored. This work revisits Cross Pseudo Supervision with dual heterogeneous backbones and introduces Knowledge Consultation (SegKC) to further enhance segmentation performance. The proposed SegKC achieves significant improvements on Pascal and Cityscapes benchmarks, with mIoU scores of 87.1%, 89.2%, and 89.8% on Pascal VOC with the 1/4, 1/2, and full split partition, respectively, while maintaining a compact model architecture.

Abstract:
Unsupervised Domain Adaptation (UDA) enables strong generalization from a labeled source domain to an unlabeled target domain, often with limited data. In parallel, Vision Foundation Models (VFMs) pretrained at scale without labels have also shown impressive downstream performance and generalization. This motivates us to explore how UDA can best leverage VFMs. Prior work (VFM‑UDA) demonstrated that replacing a standard ImageNet‑pretrained encoder with a VFM improves generalization. However, it also showed that commonly used feature distance losses harm performance when applied to VFMs. Additionally, VFM‑UDA does not incorporate multi‑scale inductive biases, which are known to improve semantic segmentation. Building on these insights, we propose VFM‑UDA++, which (1) investigates the role of multi‑scale features, (2) adapts feature distance loss to be compatible with ViT‑based VFMs and (3) evaluates how UDA benefits from increased synthetic source and real target data. By addressing these questions, we can improve performance on the standard GTA5 \rightarrow Cityscapes benchmark by +1.4 mIoU. While prior non‑VFM UDA methods did not scale with more data, VFM‑UDA++ shows consistent improvement and achieves a further +2.4 mIoU gain when scaling the data, demonstrating that VFM‑based UDA continues to benefit from increased data availability.

Abstract:
Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera‑based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self‑supervised cross‑modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR‑based models. However, they either rely on highly complex distillation losses, pseudo‑semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self‑supervised, cross‑modal 2D‑to‑3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo‑semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self‑supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D‑to‑3D KD demonstrate that CleverDistiller achieves state‑of‑the‑art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

Abstract:
Recent advances in conditional image generation from diffusion models have shown great potential in achieving impressive image quality while preserving the constraints introduced by the user. In particular, ControlNet enables precise alignment between ground truth segmentation masks and the generated image content, allowing the enhancement of training datasets in segmentation tasks. This raises a key question: Can ControlNet additionally be guided to generate the most informative synthetic samples for a specific task? Inspired by active learning, where the most informative real‑world samples are selected based on sample difficulty or model uncertainty, we propose the first approach to integrate active learning‑based selection metrics into the backward diffusion process for sample generation. Specifically, we explore uncertainty, query by committee, and expected model change, which are commonly used in active learning, and demonstrate their application for guiding the sample generation process through gradient approximation. Our method is training‑free, modifying only the backward diffusion process, allowing it to be used on any pretrained ControlNet. Using this process, we show that segmentation models trained with guided synthetic data outperform those trained on non‑guided synthetic data. Our work underscores the need for advanced control mechanisms for diffusion‑based models, which are not only aligned with image content but additionally downstream task performance, highlighting the true potential of synthetic data generation.

Abstract:
Automatic Video Object Segmentation (AVOS) refers to the task of autonomously segmenting target objects in video sequences without relying on human‑provided annotations in the first frames. In AVOS, the use of motion information is crucial, with optical flow being a commonly employed method for capturing motion cues. However, the computation of optical flow is resource‑intensive, making it unsuitable for real‑time applications, especially on edge devices with limited computational resources. In this study, we propose using frame differences as an alternative to optical flow for motion cue extraction. We developed an extended U‑Net‑like AVOS model that takes a frame on which segmentation is performed and a frame difference as inputs, and outputs an estimated segmentation map. Our experimental results demonstrate that the proposed model achieves performance comparable to the model with optical flow as an input, particularly when applied to videos captured by stationary cameras. Our results suggest the usefulness of employing frame differences as motion cues in cases with limited computational resources.

Abstract:
This study introduces a lightweight U‑Net model optimized for real‑time semantic segmentation of aerial images, targeting the efficient utilization of Commercial Off‑The‑Shelf (COTS) embedded computing platforms. We maintain the accuracy of the U‑Net on a real‑world dataset while significantly reducing the model's parameters and Multiply‑Accumulate (MAC) operations by a factor of 16. Our comprehensive analysis covers three hardware platforms (CPU, GPU, and FPGA) and five different toolchains (TVM, FINN, Vitis AI, TensorFlow GPU, and cuDNN), assessing each on metrics such as latency, power consumption, memory footprint, energy efficiency, and FPGA resource usage. The results highlight the trade‑offs between these platforms and toolchains, with a particular focus on the practical deployment challenges in real‑world applications. Our findings demonstrate that while the FPGA with Vitis AI emerges as the superior choice due to its performance, energy efficiency, and maturity, it requires specialized hardware knowledge, emphasizing the need for a balanced approach in selecting embedded computing solutions for semantic segmentation tasks

Abstract:
The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation‑Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.

Abstract:
Semantic segmentation is essential for analyzing highdefinition remote sensing images (HRSIs) because it allows the precise classification of objects and regions at the pixel level. However, remote sensing data present challenges owing to geographical location, weather, and environmental variations, making it difficult for semantic segmentation models to generalize across diverse scenarios. Existing methods are often limited to specific data domains and require expert annotators and specialized equipment for semantic labeling. In this study, we propose a novel unsupervised domain adaptation technique for remote sensing semantic segmentation by utilizing geographical coordinates that are readily accessible in remote sensing setups as metadata in a dataset. To bridge the domain gap, we propose a novel approach that considers the combination of an imageś location encoding trait and the spherical nature of Earthś surface. Our proposed SegDesicNet module regresses the GRID positional encoding of the geo coordinates projected over the unit sphere to obtain the domain loss. Our experimental results demonstrate that the proposed SegDesicNet outperforms state of the art domain adaptation methods in remote sensing image segmentation, achieving an improvement of approximately ~6% in the mean intersection over union (MIoU) with a ~ 27% drop in parameter count on benchmarked subsets of the publicly available FLAIR #1 dataset. We also benchmarked our method performance on the custom split of the ISPRS Potsdam dataset. Our algorithm seeks to reduce the modeling disparity between artificial neural networks and human comprehension of the physical world, making the technology more human centric and scalable.

Abstract:
Low‑level texture feature/knowledge is also of vital importance for characterizing the local structural pattern and global statistical properties, such as boundary, smoothness, regularity, and color contrast, which may not be well addressed by high‑level deep features. In this paper, we aim to re‑emphasize the low‑level texture information in deep networks for semantic segmentation and related knowledge distillation tasks. To this end, we take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low‑level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge, and Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge with the corresponding Quantization Congruence Loss (QDL). Moreover, we propose the Co‑occurrence TIEM (C‑TIEM) and generic segmentation frameworks, namely STLNet++ and U‑SSNet, to enable existing segmentation networks to harvest the structural and statistical texture information more effectively. Extensive experimental results on three segmentation tasks demonstrate the effectiveness of the proposed methods and their state‑of‑the‑art performance on seven popular benchmark datasets, respectively.

Abstract:
Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self‑supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least‑squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re‑projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end‑to‑end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene‑specific structure ‑ highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self‑supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.

Abstract:
Pre‑trained encoders are widely employed in dense prediction tasks for their capability to effectively extract visual features from images. The decoder subsequently processes these features to generate pixel‑level predictions. However, due to structural differences and variations in input data, only encoders benefit from pre‑learned representations from vision benchmarks such as image classification and self‑supervised learning, while decoders are typically trained from scratch. In this paper, we introduce ×Net, which facilitates a "pre‑trained encoder × pre‑trained decoder" collaboration through three innovative designs. ×Net enables the direct utilization of pre‑trained models within the decoder, integrating pre‑learned representations into the decoding process to enhance performance in dense prediction tasks. By simply coupling the pre‑trained encoder and pre‑trained decoder, ×Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding‑specific structures or task‑specific algorithms. Despite its streamlined design, ×Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state‑of‑the‑art performance particularly in monocular depth estimation. and semantic segmentation, achieving state‑of‑the‑art results, especially in monocular depth estimation. embedding algorithms. Despite its streamlined design, ×Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state‑of‑the‑art performance particularly in monocular depth estimation.

Abstract:
Multimodal Large Language Models (MLLMs) demonstrate robust zero‑shot capabilities across diverse vision‑language tasks after training on mega‑scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi‑task learning and multi‑granularity scenarios. In this work, we present REF‑VLM, an end‑to‑end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet‑Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual‑Task Instruction Following Dataset (VT‑Instruct), a large‑scale multi‑task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT‑Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF‑VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF‑VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo will be publicly available.

Abstract:
Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted information. However, traditional SC methods face inefficiencies due to the repeated transmission of static frames in edge videos, exacerbated by the absence of sensing capabilities, which results in spectrum inefficiency. To address this challenge, we propose a SC with computer vision sensing (SCCVS) framework for edge video transmission. The framework first introduces a compression ratio (CR) adaptive SC (CRSC) model, capable of adjusting CR based on whether the frames are static or dynamic, effectively conserving spectrum resources. Additionally, we implement an object detection and semantic segmentation models‑enabled sensing (OSMS) scheme, which intelligently senses the changes in the scene and assesses the significance of each frame through in‑context analysis. Hence, The OSMS scheme provides CR prompts to the CRSC model based on real‑time sensing results. Moreover, both CRSC and OSMS are designed as lightweight models, ensuring compatibility with resource‑constrained sensors commonly used in practical edge applications. Experimental simulations validate the effectiveness of the proposed SCCVS framework, demonstrating its ability to enhance transmission efficiency without sacrificing critical semantic information.

Abstract:
Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self‑driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self‑supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self‑supervised pre‑training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre‑training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection‑over‑Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object‑level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training‑from‑scratch baseline and other self‑supervised pre‑training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre‑trained models will be publicly available upon publication.

Abstract:
Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to 360^\circ domain, the significant field‑of‑view (FoV) gap between pinhole (70^\circ × 70^\circ) and panoramic images (180^\circ × 360^\circ) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel‑level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross‑patch correspondences that embeds the cross‑FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine‑tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV‑based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state‑of‑the‑art methods by large margins, e.g., 79.06% (+10.22%) on SPin8‑to‑SPan8, 62.46% (+6.58%) on CS13‑to‑DP13.

Abstract:
This paper demonstrates a surprising result for segmentation with image‑level targets: extending binary class tags to approximate relative object‑size distributions allows off‑the‑shelf architectures to solve the segmentation problem. A straightforward zero‑avoiding KL‑divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel‑precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non‑reproducible architectural modifications and specialized multi‑stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel‑level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image‑level supervision in segmentation and may encourage other simple general solutions to the problem.

Abstract:
Understanding 3D object shapes necessitates shape representation by object parts abstracted from results of instance and semantic segmentation. Promising shape representations enable computers to interpret a shape with meaningful parts and identify their repeatability. However, supervised shape representations depend on costly annotation efforts, while current unsupervised methods work under strong semantic priors and involve multi‑stage training, thereby limiting their generalization and deployment in shape reasoning and understanding. Driven by the tendency of high‑dimensional semantically similar features to lie in or near low‑dimensional subspaces, we introduce a one‑stage, fully unsupervised framework towards semantic‑aware shape representation. This framework produces joint instance segmentation, semantic segmentation, and shape abstraction through sparse representation and feature alignment of object parts in a high‑dimensional space. For sparse representation, we devise a sparse latent membership pursuit method that models each object part feature as a sparse convex combination of point features at either the semantic or instance level, promoting part features in the same subspace to exhibit similar semantics. For feature alignment, we customize an attention‑based strategy in the feature space to align instance‑ and semantic‑level object part features and reconstruct the input shape using both of them, ensuring geometric reusability and semantic consistency of object parts. To firm up semantic disambiguation, we construct cascade unfrozen learning on geometric parameters of object parts.

Abstract:
Dynamic scene rendering opens new avenues in autonomous driving by enabling closed‑loop simulations with photorealistic data, which is crucial for validating end‑to‑end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self‑supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self‑supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa‑4DGS deforms semantic features with each Gaussian, enabling broader applications.

Abstract:
Research has focused on Multi‑Modal Semantic Segmentation (MMSS), where pixel‑wise predictions are derived from multiple visual modalities captured by diverse sensors. Recently, the large vision model, Segment Anything Model 2 (SAM2), has shown strong zero‑shot segmentation performance on both images and videos. When extending SAM2 to MMSS, two issues arise: 1. How can SAM2 be adapted to multi‑modal data? 2. How can SAM2 better understand semantics? Inspired by cross‑frame correlation in videos, we propose to treat multi‑modal data as a sequence of frames representing the same scene. Our key idea is to ''memorize'' the modality‑agnostic information and 'memorize' the semantics related to the targeted scene. To achieve this, we apply SAM2's memory mechanisms across multi‑modal data to capture modality‑agnostic features. Meanwhile, to memorize the semantic knowledge, we propose a training‑only Semantic Prototype Memory Module (SPMM) to store category‑level prototypes across training for facilitating SAM2's transition from instance to semantic segmentation. A prototypical adaptation loss is imposed between global and local prototypes iteratively to align and refine SAM2's semantic understanding. Extensive experimental results demonstrate that our proposed MemorySAM outperforms SoTA methods by large margins on both synthetic and real‑world benchmarks (65.38% on DELIVER, 52.88% on MCubeS). Source code will be made publicly available.

Abstract:
Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision‑Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

Abstract:
Semantic segmentation is a core task in computer vision with applications in biomedical imaging, remote sensing, and autonomous driving. While standard loss functions such as cross‑entropy and Dice loss perform well in general cases, they often struggle with fine structures, particularly in tasks involving thin structures or closely packed objects. Various weight map‑based loss functions have been proposed to address this issue by assigning higher loss weights to pixels prone to misclassification. However, these methods typically rely on precomputed or runtime‑generated weight maps based on distance transforms, which impose significant computational costs and fail to adapt to evolving network predictions. In this paper, we propose a novel steerable pyramid‑based weighted (SPW) loss function that efficiently generates adaptive weight maps. Unlike traditional boundary‑aware losses that depend on static or iteratively updated distance maps, our method leverages steerable pyramids to dynamically emphasize regions across multiple frequency bands (capturing features at different scales) while maintaining computational efficiency. Additionally, by incorporating network predictions into the weight computation, our approach enables adaptive refinement during training. We evaluate our method on the SNEMI3D, GlaS, and DRIVE datasets, benchmarking it against 11 state‑of‑the‑art loss functions. Our results demonstrate that the proposed SPW loss function achieves superior pixel precision and segmentation accuracy with minimal computational overhead. This work provides an effective and efficient solution for improving semantic segmentation, particularly for applications requiring multiscale feature representation. The code is avaiable at https://anonymous.4open.science/r/SPW‑0884

Abstract:
3D neuroimages provide a comprehensive view of brain structure and function, aiding in precise localization and functional connectivity analysis. Segmentation of white matter (WM) tracts using 3D neuroimages is vital for understanding the brain's structural connectivity in both healthy and diseased states. One‑shot Class Incremental Semantic Segmentation (OCIS) refers to effectively segmenting new (novel) classes using only a single sample while retaining knowledge of old (base) classes without forgetting. Voxel‑contrastive OCIS methods adjust the feature space to alleviate the feature overlap problem between the base and novel classes. However, since WM tract segmentation is a multi‑label segmentation task, existing single‑label voxel contrastive‑based methods may cause inherent contradictions. To address this, we propose a new multi‑label voxel contrast framework called MultiCo3D for one‑shot class incremental tract segmentation. Our method utilizes uncertainty distillation to preserve base tract segmentation knowledge while adjusting the feature space with multi‑label voxel contrast to alleviate feature overlap when learning novel tracts and dynamically weighting multi losses to balance overall loss. We compare our method against several state‑of‑the‑art (SOTA) approaches. The experimental results show that our method significantly enhances one‑shot class incremental tract segmentation accuracy across five different experimental setups on HCP and Preto datasets.

Abstract:
Dense visual prediction tasks, such as detection and segmentation, are crucial for time‑critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature‑based KD methods rely on static, teacher‑driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student‑teacher interactions. To address these limitations, we propose Adaptive student‑teacher Cooperative Attention Masking for Knowledge Distillation (ACAM‑KD), which introduces two key components: (1) Student‑Teacher Cross‑Attention Feature Fusion (STCA‑FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial‑Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel‑wise feature selection. Unlike conventional KD methods, ACAM‑KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM‑KD improves object detection performance by up to 1.4 mAP over the state‑of‑the‑art when distilling a ResNet‑50 student from a ResNet‑101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3‑MobileNetV2 as the student model.

Abstract:
Amodal instance segmentation, which aims to detect and segment both visible and invisible parts of objects in images, plays a crucial role in various applications including autonomous driving, robotic manipulation, and scene understanding. While existing methods require training both front‑end detectors and mask decoders jointly, this approach lacks flexibility and fails to leverage the strengths of pre‑existing modal detectors. To address this limitation, we propose SAMEO, a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder capable of interfacing with various front‑end detectors to enable mask prediction even for partially occluded objects. Acknowledging the constraints of limited amodal segmentation datasets, we introduce Amodal‑LVIS, a large‑scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets. This dataset significantly expands the training data available for amodal segmentation research. Our experimental results demonstrate that our approach, when trained on the newly extended dataset, including Amodal‑LVIS, achieves remarkable zero‑shot performance on both COCOA‑cls and D2SA benchmarks, highlighting its potential for generalization to unseen scenarios.

Abstract:
Minimally invasive surgery (MIS) requires high‑fidelity, real‑time visual feedback of dynamic and low‑texture surgical scenes. To address these requirements, we introduce FeatureEndo‑4DGS (FE‑4DGS), the first real time pipeline leveraging feature‑distilled 4D Gaussian Splatting for simultaneous reconstruction and semantic segmentation of deformable surgical environments. Unlike prior feature‑distilled methods restricted to static scenes, and existing 4D approaches that lack semantic integration, FE‑4DGS seamlessly leverages pre‑trained 2D semantic embeddings to produce a unified 4D representation‑where semantics also deform with tissue motion. This unified approach enables the generation of real‑time RGB and semantic outputs through a single, parallelized rasterization process. Despite the additional complexity from feature distillation, FE‑4DGS sustains real‑time rendering (61 FPS) with a compact footprint, achieves state‑of‑the‑art rendering fidelity on EndoNeRF (39.1 PSNR) and SCARED (27.3 PSNR), and delivers competitive EndoVis18 segmentation, matching or exceeding strong 2D baselines for binary segmentation tasks (0.93 DSC) and remaining competitive for multi‑label segmentation (0.77 DSC).

Abstract:
Diffusion probabilistic models are traditionally used to generate colors at fixed pixel positions in 2D images. Building on this, we extend diffusion models to point cloud semantic segmentation, where point positions also remain fixed, and the diffusion model generates point labels instead of colors. To accelerate the denoising process in reverse diffusion, we introduce a noisy label embedding mechanism. This approach integrates semantic information into the noisy label, providing an initial semantic reference that improves the reverse diffusion efficiency. Additionally, we propose a point frequency transformer that enhances the adjustment of high‑level context in point clouds. To reduce computational complexity, we introduce the position condition into MLP and propose denoising PointNet to process the high‑resolution point cloud without sacrificing geometric details. Finally, we integrate the proposed noisy label embedding, point frequency transformer and denoising PointNet in our proposed dual conditional diffusion model‑based network (PointDiffuse) to perform large‑scale point cloud semantic segmentation. Extensive experiments on five benchmarks demonstrate the superiority of PointDiffuse, achieving the state‑of‑the‑art mIoU of 74.2% on S3DIS Area 5, 81.2% on S3DIS 6‑fold and 64.8% on SWAN dataset.

Abstract:
Modern orchards are planted in structured rows with distinct panel divisions to improve management. Accurate and efficient joint segmentation of point cloud from Panel to Tree and Branch (P2TB) is essential for robotic operations. However, most current segmentation methods focus on single instance segmentation and depend on a sequence of deep networks to perform joint tasks. This strategy hinders the use of hierarchical information embedded in the data, leading to both error accumulation and increased costs for annotation and computation, which limits its scalability for real‑world applications. In this study, we proposed a novel approach that incorporated a Real2Sim L‑TreeGen for training data generation and a joint model (J‑P2TB) designed for the P2TB task. The J‑P2TB model, trained on the generated simulation dataset, was used for joint segmentation of real‑world panel point clouds via zero‑shot learning. Compared to representative methods, our model outperformed them in most segmentation metrics while using 40% fewer learnable parameters. This Sim2Real result highlighted the efficacy of L‑TreeGen in model training and the performance of J‑P2TB for joint segmentation, demonstrating its strong accuracy, efficiency, and generalizability for real‑world applications. These improvements would not only greatly benefit the development of robots for automated orchard operations but also advance digital twin technology.

Abstract:
Cutting‑edge robot learning techniques including foundation models and imitation learning from humans all pose huge demands on large‑scale and high‑quality datasets which constitute one of the bottleneck in the general intelligent robot fields. This paper presents the Kaiwu multimodal dataset to address the missing real‑world synchronized multimodal data problems in the sophisticated assembling scenario,especially with dynamics information and its fine‑grained labelling. The dataset first provides an integration of human,environment and robot data collection framework with 20 subjects and 30 interaction objects resulting in totally 11,664 instances of integrated actions. For each of the demonstration,hand motions,operation pressures,sounds of the assembling process,multi‑view videos, high‑precision motion capture information,eye gaze with first‑person videos,electromyography signals are all recorded. Fine‑grained multi‑level annotation based on absolute timestamp,and semantic segmentation labelling are performed. Kaiwu dataset aims to facilitate robot learning,dexterous manipulation,human intention investigation and human‑robot collaboration research.

Abstract:
The reliance on large labeled datasets presents a significant challenge in medical image segmentation. Few‑shot learning offers a potential solution, but existing methods often still require substantial training data. This paper proposes a novel approach that leverages the Segment Anything Model 2 (SAM2), a vision foundation model with strong video segmentation capabilities. We conceptualize 3D medical image volumes as video sequences, departing from the traditional slice‑by‑slice paradigm. Our core innovation is a support‑query matching strategy: we perform extensive data augmentation on a single labeled support image and, for each frame in the query volume, algorithmically select the most analogous augmented support image. This selected image, along with its corresponding mask, is used as a mask prompt, driving SAM2's video segmentation. This approach entirely avoids model retraining or parameter updates. We demonstrate state‑of‑the‑art performance on benchmark few‑shot medical image segmentation datasets, achieving significant improvements in accuracy and annotation efficiency. This plug‑and‑play method offers a powerful and generalizable solution for 3D medical image segmentation.

Abstract:
For scene understanding in unstructured environments, an accurate and uncertainty‑aware metric‑semantic mapping is required to enable informed action selection by autonomous systems. Existing mapping methods often suffer from overconfident semantic predictions, and sparse and noisy depth sensing, leading to inconsistent map representations. In this paper, we therefore introduce EvidMTL, a multi‑task learning framework that uses evidential heads for depth estimation and semantic segmentation, enabling uncertainty‑aware inference from monocular RGB images. To enable uncertainty‑calibrated evidential multi‑task learning, we propose a novel evidential depth loss function that jointly optimizes the belief strength of the depth prediction in conjunction with evidential segmentation loss. Building on this, we present EvidKimera, an uncertainty‑aware semantic surface mapping framework, which uses evidential depth and semantics prediction for improved 3D metric‑semantic consistency. We train and evaluate EvidMTL on the NYUDepthV2 and assess its zero‑shot performance on ScanNetV2, demonstrating superior uncertainty estimation compared to conventional approaches while maintaining comparable depth estimation and semantic segmentation. In zero‑shot mapping tests on ScanNetV2, EvidKimera outperforms Kimera in semantic surface mapping accuracy and consistency, highlighting the benefits of uncertainty‑aware mapping and underscoring its potential for real‑world robotic applications.

Abstract:
Point clouds from Terrestrial Laser Scanning (TLS) are an increasingly popular source of data for studying plant structure and function but typically require extensive manual processing to extract ecologically important information. One key task is the accurate semantic segmentation of different plant material within point clouds, particularly wood and leaves, which is required to understand plant productivity, architecture and physiology. Existing automated semantic segmentation methods are primarily developed for single ecosystem types, and whilst they show good accuracy for biomass assessment from the trunk and large branches, often perform less well within the crown. In this study, we demonstrate a new framework that uses a deep learning architecture newly developed from PointNet and pointNEXT for processing 3D point clouds to provide a reliable semantic segmentation of wood and leaf in TLS point clouds from the tree base to branch tips, trained on data from diverse mature European forests. Our model uses meticulously labelled data combined with voxel‑based sampling, neighbourhood rescaling, and a novel gated reflectance integration module embedded throughout the feature extraction layers. We evaluate its performance across open datasets from boreal, temperate, Mediterranean and tropical regions, encompassing diverse ecosystem types and sensor characteristics. Our results show consistent outperformance against the most widely used PointNet based approach for leaf/wood segmentation on our high‑density TLS dataset collected across diverse mixed forest plots across all major biomes in Europe. We also find consistently strong performance tested on others open data from China, Eastern Cameroon, Germany and Finland, collected using both time‑of‑flight and phase‑shift sensors, showcasing the transferability of our model to a wide range of ecosystems and sensors.

Abstract:
Monocular visual localization plays a pivotal role in advanced driver assistance systems and autonomous driving by estimating a vehicle's ego‑motion from a single pinhole camera. Nevertheless, conventional monocular visual odometry encoun‑ters challenges in scale estimation due to the absence of depth information during projection. Previous methodologies, whether rooted in physical constraints or deep learning paradigms, con‑tend with issues related to computational complexity and the management of dynamic objects. This study extends our prior research, presenting innovative strategies for ego‑motion estima‑tion and the selection of ground points. Striving for a nuanced equilibrium between computational efficiency and precision, we propose a hybrid method that leverages the SegNeXt model for real‑time applications, encompassing both ego‑motion estimation and ground point selection. Our methodology incorporates dy‑namic object masks to eliminate unstable features and employs ground plane masks for meticulous triangulation. Furthermore, we exploit Geometry‑constraint to delineate road regions for scale recovery. The integration of this approach with the mo‑nocular version of ORB‑SLAM3 culminates in the accurate esti‑mation of a road model, a pivotal component in our scale recov‑ery process. Rigorous experiments, conducted on the KITTI da‑taset, systematically compare our method with existing monocu‑lar visual odometry algorithms and contemporary scale recovery methodologies. The results undeniably confirm the superior ef‑fectiveness of our approach, surpassing state‑of‑the‑art visual odometry algorithms. Our source code is available at https://git hub.com/bFr0zNq/MVOSegScale.

Abstract:
RGB‑Thermal fusion is a potential solution for various weather and light conditions in challenging scenarios. However, plenty of studies focus on designing complex modules to fuse different modalities. With the widespread application of large language models (LLMs), valuable information can be more effectively extracted from natural language. Therefore, we aim to leverage the advantages of large language models to design a structurally simple and highly adaptable multimodal fusion model architecture. We proposed MultimodAl Segmentation with TExt PRompts (MASTER) architecture, which integrates LLM into the fusion of RGB‑Thermal multimodal data and allows complex query text to participate in the fusion process. Our model utilizes a dual‑path structure to extract information from different modalities of images. Additionally, we employ LLM as the core module for multimodal fusion, enabling the model to generate learnable codebook tokens from RGB, thermal images, and textual information. A lightweight image decoder is used to obtain semantic segmentation results. The proposed MASTER performs exceptionally well in benchmark tests across various automated driving scenarios, yielding promising results.

Abstract:
In intelligent transportation systems (ITSs), incorporating pedestrians and vehicles in‑the‑loop is crucial for developing realistic and safe traffic management solutions. However, there is falls short of simulating complex real‑world ITS scenarios, primarily due to the lack of a digital twin implementation framework for characterizing interactions between pedestrians and vehicles at different locations in different traffic environments. In this article, we propose a surveillance video assisted federated digital twin (SV‑FDT) framework to empower ITSs with pedestrians and vehicles in‑the‑loop. Specifically, SVFDT builds comprehensive pedestrian‑vehicle interaction models by leveraging multi‑source traffic surveillance videos. Its architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation‑based visual understanding, twin agent‑based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime. We analyze key design requirements and challenges and present core guidelines for SVFDT's system implementation. A testbed evaluation demonstrates its effectiveness in optimizing traffic management. Comparisons with traditional terminal‑server frameworks highlight SV‑FDT's advantages in mirroring delays, recognition accuracy, and subjective evaluation. Finally, we identify some open challenges and discuss future research directions.

Abstract:
3D occupancy prediction has recently emerged as a new paradigm for holistic 3D scene understanding and provides valuable information for downstream planning in autonomous driving. Most existing methods, however, are computationally expensive, requiring costly attention‑based 2D‑3D transformation and 3D feature processing. In this paper, we present a novel 3D occupancy prediction approach, H3O, which features highly efficient architecture designs that incur a significantly lower computational cost as compared to the current state‑of‑the‑art methods. In addition, to compensate for the ambiguity in ground‑truth 3D occupancy labels, we advocate leveraging auxiliary tasks to complement the direct 3D supervision. In particular, we integrate multi‑camera depth estimation, semantic segmentation, and surface normal estimation via differentiable volume rendering, supervised by corresponding 2D labels that introduces rich and heterogeneous supervision signals. We conduct extensive experiments on the Occ3D‑nuScenes and SemanticKITTI benchmarks that demonstrate the superiority of our proposed H3O.

Abstract:
Recent advancements in 3D Gaussian Splatting(3DGS) have significantly improved semantic scene understanding, enabling natural language queries to localize objects within a scene. However, existing methods primarily focus on embedding compressed CLIP features to 3D Gaussians, suffering from low object segmentation accuracy and lack spatial reasoning capabilities. To address these limitations, we propose GaussianGraph, a novel framework that enhances 3DGS‑based scene understanding by integrating adaptive semantic clustering and scene graph generation. We introduce a "Control‑Follow" clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy. Additionally, we enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models. To address inaccuracies in spatial relationships, we propose 3D correction modules that filter implausible relations through spatial consistency verification, ensuring reliable scene graph construction. Extensive experiments on three datasets demonstrate that GaussianGraph outperforms state‑of‑the‑art methods in both semantic segmentation and object grounding tasks, providing a robust solution for complex scene understanding and interaction.

Abstract:
Autonomous off‑road navigation faces challenges due to diverse, unstructured environments, requiring robust perception with both geometric and semantic understanding. However, scarce densely labeled semantic data limits generalization across domains. Simulated data helps, but introduces domain adaptation issues. We propose COARSE, a semi‑supervised domain adaptation framework for off‑road semantic segmentation, leveraging sparse, coarse in‑domain labels and densely labeled out‑of‑domain data. Using pretrained vision transformers, we bridge domain gaps with complementary pixel‑level and patch‑level decoders, enhanced by a collaborative pseudo‑labeling strategy on unlabeled data. Evaluations on RUGD and Rellis‑3D datasets show significant improvements of 9.7% and 8.4% respectively, versus only using coarse data. Tests on real‑world off‑road vehicle data in a multi‑biome setting further demonstrate COARSE's applicability.

Abstract:
Background: We evaluate SAM 2 for surgical scene understanding by examining its semantic segmentation capabilities for organs/tissues both in zero‑shot scenarios and after fine‑tuning. Methods: We utilized five public datasets to evaluate and fine‑tune SAM 2 for segmenting anatomical tissues in surgical videos/images. Fine‑tuning was applied to the image encoder and mask decoder. We limited training subsets from 50 to 400 samples per class to better model real‑world constraints with data acquisition. The impact of dataset size on fine‑tuning performance was evaluated with weighted mean Dice coefficient (WMDC), and the results were also compared against previously reported state‑of‑the‑art (SOTA) results. Results: SurgiSAM 2, a fine‑tuned SAM 2 model, demonstrated significant improvements in segmentation performance, achieving a 17.9% relative WMDC gain compared to the baseline SAM 2. Increasing prompt points from 1 to 10 and training data scale from 50/class to 400/class enhanced performance; the best WMDC of 0.92 on the validation subset was achieved with 10 prompt points and 400 samples per class. On the test subset, this model outperformed prior SOTA methods in 24/30 (80%) of the classes with a WMDC of 0.91 using 10‑point prompts. Notably, SurgiSAM 2 generalized effectively to unseen organ classes, achieving SOTA on 7/9 (77.8%) of them. Conclusion: SAM 2 achieves remarkable zero‑shot and fine‑tuned performance for surgical scene segmentation, surpassing prior SOTA models across several organ classes of diverse datasets. This suggests immense potential for enabling automated/semi‑automated annotation pipelines, thereby decreasing the burden of annotations facilitating several surgical applications.

Abstract:
Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi‑modal information together to generate per‑frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, an efficient decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision‑text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. FindTrack significantly outperforms all existing methods on public benchmarks, demonstrating its superiority.

Abstract:
Digitalization in the construction industry has become essential, enabling centralized, easy access to all relevant information of a building. Automated systems can facilitate the timely and resource‑efficient documentation of changes, which is crucial for key processes such as progress tracking and quality control. This paper presents a method for image‑based automated drywall analysis enabling construction progress and quality assessment through on‑site camera systems. Our proposed solution integrates a deep learning‑based instance segmentation model to detect and classify various drywall elements with an analysis module to cluster individual wall segments, estimate camera perspective distortions, and apply the corresponding corrections. This system extracts valuable information from images, enabling more accurate progress tracking and quality assessment on construction sites. Our main contributions include a fully automated pipeline for drywall analysis, improving instance segmentation accuracy through architecture modifications and targeted data augmentation, and a novel algorithm to extract important information from the segmentation results. Our modified model, enhanced with data augmentation, achieves significantly higher accuracy compared to other architectures, offering more detailed and precise information than existing approaches. Combined with the proposed drywall analysis steps, it enables the reliable automation of construction progress and quality assessment.

Abstract:
LiDAR semantic segmentation models are typically trained from random initialization as universal pre‑training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision‑based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range‑view and bird's‑eye‑view LiDAR encoding mechanisms, which we combine through a novel 2D‑3D adapter. While the range‑view features are processed through a frozen image backbone, our bird's‑eye‑view branch enhances them through multiple cross‑attention interactions. Thereby, we continuously improve the vision network with domain‑dependent knowledge, resulting in a strong label‑efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state‑of‑the‑art methods on small data regimes. We make the code and models publicly available at: http://balvit.cs.uni‑freiburg.de.

Abstract:
Accurate motion understanding of the dynamic objects within the scene in bird's‑eye‑view (BEV) is critical to ensure a reliable obstacle avoidance system and smooth path planning for autonomous vehicles. However, this task has received relatively limited exploration when compared to object detection and segmentation with only a few recent vision‑based approaches presenting preliminary findings that significantly deteriorate in low‑light, nighttime, and adverse weather conditions such as rain. Conversely, LiDAR and radar sensors remain almost unaffected in these scenarios, and radar provides key velocity information of the objects. Therefore, we introduce BEVMOSNet, to our knowledge, the first end‑to‑end multimodal fusion leveraging cameras, LiDAR, and radar to precisely predict the moving objects in BEV. In addition, we perform a deeper analysis to find out the optimal strategy for deformable cross‑attention‑guided sensor fusion for cross‑sensor knowledge sharing in BEV. While evaluating BEVMOSNet on the nuScenes dataset, we show an overall improvement in IoU score of 36.59% compared to the vision‑based unimodal baseline BEV‑MoSeg (Sigatapu et al., 2023), and 2.35% compared to the multimodel SimpleBEV (Harley et al., 2022), extended for the motion segmentation task, establishing this method as the state‑of‑the‑art in BEV motion segmentation.

Abstract:
Generating large‑scale sensing datasets through photo‑realistic simulation is an important aspect of many robotics applications such as autonomous driving. In this paper, we consider the problem of synchronous data collection from the open‑source CARLA simulator using multiple sensors attached to vehicle based on user‑defined criteria. We propose a novel, one‑step framework that we refer to as Car‑STAGE, based on CARLA simulator, to generate data using a graphical user interface (GUI) defining configuration parameters to data collection without any user intervention. This framework can utilize the user‑defined configuration parameters such as choice of maps, number and configurations of sensors, environmental and lighting conditions etc. to run the simulation in the background, collecting high‑dimensional sensor data from diverse sensors such as RGB Camera, LiDAR, Radar, Depth Camera, IMU Sensor, GNSS Sensor, Semantic Segmentation Camera, Instance Segmentation Camera, and Optical Flow Camera along with the ground‑truths of the individual actors and storing the sensor data as well as ground‑truth labels in a local or cloud‑based database. The framework uses multiple threads where a main thread runs the server, a worker thread deals with queue and frame number and the rest of the threads processes the sensor data. The other way we derive speed up over the native implementation is by memory mapping the raw binary data into the disk and then converting the data into known formats at the end of data collection. We show that using these techniques, we gain a significant speed up over frames, under an increasing set of sensors and over the number of spawned objects.

Abstract:
The Segment Anything Model (SAM) has revolutionized image and video segmentation with its powerful zero‑shot capabilities. However, its massive parameter scale and high computational demands hinder efficient deployment on resource‑constrained edge devices. While Post‑Training Quantization (PTQ) offers a practical solution, existing methods still fail to handle four critical quantization challenges: (1) ill‑conditioned weights; (2) skewed and long‑tailed post‑GELU activations; (3) pronounced inter‑channel variance in linear projections; and (4) exponentially scaled and heterogeneous attention scores. To mitigate these bottlenecks, we propose AHCQ‑SAM, an accurate and hardware‑compatible PTQ framework featuring four synergistic components: (1) Activation‑aware Condition Number Reduction (ACNR), which regularizes weight matrices via a proximal point algorithm to suppress ill‑conditioning; (2) Hybrid Log‑Uniform Quantization (HLUQ), which combines power‑of‑two and uniform quantizers to capture skewed post‑GELU activations; (3) Channel‑Aware Grouping (CAG), which clusters channels with homogeneous statistics to achieve high accuracy with minimal hardware overhead; and (4) Logarithmic Nonlinear Quantization (LNQ), which utilizes logarithmic transformations to adaptively adjust quantization resolution for exponential and heterogeneous attention scores. Experimental results demonstrate that AHCQ‑SAM outperforms current methods on SAM. Compared with the SOTA method, it achieves a 15.2% improvement in mAP for 4‑bit SAM‑B with Faster R‑CNN on the COCO dataset. Furthermore, we establish a PTQ benchmark for SAM2, where AHCQ‑SAM yields a 14.01% improvement in J&F for 4‑bit SAM2‑Tiny on the SA‑V Test dataset. Finally, FPGA‑based implementation validates the practical utility of AHCQ‑SAM, delivering a 7.12x speedup and a 6.62x power efficiency improvement over the floating‑point baseline.

Abstract:
Semi‑supervised semantic segmentation has witnessed remarkable advancements in recent years. However, existing algorithms are based on convolutional neural networks and directly applying them to Vision Transformers poses certain limitations due to conceptual disparities. To this end, we propose TokenMix, a data augmentation technique specifically designed for semi‑supervised semantic segmentation with Vision Transformers. TokenMix aligns well with the global attention mechanism by mixing images at the token level, enhancing learning capability for contextual information among image patches. We further incorporate image augmentation and feature augmentation to promote the diversity of augmentation. Moreover, to enhance consistency regularization, we propose a dual‑branch framework where each branch applies image and feature augmentation to the input image. We conduct extensive experiments across multiple benchmark datasets, including Pascal VOC 2012, Cityscapes, and COCO. Results suggest that the proposed method outperforms state‑of‑the‑art algorithms with notably observed accuracy improvement, especially under limited fine annotations.

Abstract:
A main bottleneck of learning‑based robotic scene understanding methods is the heavy reliance on extensive annotated training data, which often limits their generalization ability. In LiDAR panoptic segmentation, this challenge becomes even more pronounced due to the need to simultaneously address both semantic and instance segmentation from complex, high‑dimensional point cloud data. In this work, we address the challenge of LiDAR panoptic segmentation with very few labeled samples by leveraging recent advances in label‑efficient vision panoptic segmentation. To this end, we propose a novel method, Limited‑Label LiDAR Panoptic Segmentation (L3PS), which requires only a minimal amount of labeled data. Our approach first utilizes a label‑efficient 2D network to generate panoptic pseudo‑labels from a small set of annotated images, which are subsequently projected onto point clouds. We then introduce a novel 3D refinement module that capitalizes on the geometric properties of point clouds. By incorporating clustering techniques, sequential scan accumulation, and ground point separation, this module significantly enhances the accuracy of the pseudo‑labels, improving segmentation quality by up to +10.6 PQ and +7.9 mIoU. We demonstrate that these refined pseudo‑labels can be used to effectively train off‑the‑shelf LiDAR segmentation networks. Through extensive experiments, we show that L3PS not only outperforms existing methods but also substantially reduces the annotation burden. We release the code of our work at https://l3ps.cs.uni‑freiburg.de.

Abstract:
Video object segmentation is an emerging technology that is well‑suited for real‑time surgical video segmentation, offering valuable clinical assistance in the operating room by ensuring consistent frame tracking. However, its adoption is limited by the need for manual intervention to select the tracked object, making it impractical in surgical settings. In this work, we tackle this challenge with an innovative solution: using previously annotated frames from other patients as the tracking frames. We find that this unconventional approach can match or even surpass the performance of using patients' own tracking frames, enabling more autonomous and efficient AI‑assisted surgical workflows. Furthermore, we analyze the benefits and limitations of this approach, highlighting its potential to enhance segmentation accuracy while reducing the need for manual input. Our findings provide insights into key factors influencing performance, offering a foundation for future research on optimizing cross‑patient frame selection for real‑time surgical video analysis.

Abstract:
The scalability of instructable agents in robotics or gaming is often hindered by limited data that pairs instructions with agent trajectories. However, large datasets of unannotated trajectories containing sequences of various agent behaviour (play trajectories) are often available. In a semi‑supervised setup, we explore methods to extract labelled segments from play trajectories. The goal is to augment a small annotated dataset of instruction‑trajectory pairs to improve the performance of an instruction‑following policy trained downstream via imitation learning. Assuming little variation in segment length, recent video segmentation methods can effectively extract labelled segments. To address the constraint of segment length, we propose Play Segmentation (PS), a probabilistic model that finds maximum likely segmentations of extended subsegments, while only being trained on individual instruction segments. Our results in a game environment and a simulated robotic gripper setting underscore the importance of segmentation; randomly sampled segments diminish performance, while incorporating labelled segments from PS improves policy performance to the level of a policy trained on twice the amount of labelled data.

Abstract:
Retrieval‑augmented generation (RAG) has demonstrated significant proficiency in conducting question‑answering (QA) tasks within a specified corpus. Nonetheless, numerous failure instances of RAG in QA still exist. These failures are not solely attributable to the limitations of Large Language Models (LLMs); instead, they predominantly arise from the retrieval of inaccurate information for LLMs due to two limitations: (1) Current RAG methods segment the corpus without considering semantics, making it difficult to find relevant context due to impaired correlation between questions and the segments. (2) There is a trade‑off between missing essential context with fewer context retrieved and getting irrelevant context with more context retrieved. In this paper, we introduce a RAG framework (SAGE), to overcome these limitations. First, to address the segmentation issue without considering semantics, we propose to train a semantic segmentation model. This model is trained to segment the corpus into semantically complete chunks. Second, to ensure that only the most relevant chunks are retrieved while the irrelevant ones are ignored, we design a chunk selection algorithm to dynamically select chunks based on the decreasing speed of the relevance score, leading to a more relevant selection. Third, to further ensure the precision of the retrieved chunks, we propose letting LLMs assess whether retrieved chunks are excessive or lacking and then adjust the amount of context accordingly. Experiments show that SAGE outperforms baselines by 61.25% in the quality of QA on average. Moreover, by avoiding retrieving noisy context, SAGE lowers the cost of the tokens consumed in LLM inference and achieves a 49.41% enhancement in cost efficiency on average. Additionally, our work offers valuable insights for boosting RAG.

Abstract:
Online zero‑shot 3D instance segmentation of a progressively reconstructed scene is both a critical and challenging task for embodied applications. With the success of visual foundation models (VFMs) in the image domain, leveraging 2D priors to address 3D online segmentation has become a prominent research focus. Since segmentation results provided by 2D priors often require spatial consistency to be lifted into final 3D segmentation, an efficient method for identifying spatial overlap among 2D masks is essential ‑ yet existing methods rarely achieve this in real time, mainly limiting its use to offline approaches. To address this, we propose an efficient method that lifts 2D masks generated by VFMs into a unified 3D instance using a hashing technique. By employing voxel hashing for efficient 3D scene querying, our approach reduces the time complexity of costly spatial overlap queries from O(n^2) to O(n). Accurate spatial associations further enable 3D merging of 2D masks through simple similarity‑based filtering in a zero‑shot manner, making our approach more robust to incomplete and noisy data. Evaluated on the ScanNet and SceneNN benchmarks, our approach achieves state‑of‑the‑art performance in online, zero‑shot 3D instance segmentation with leading efficiency.

Abstract:
The proliferation of creative video content has driven demand for adapting language models to handle video input and enable multimodal understanding. However, end‑to‑end models struggle to process long videos due to their size and complexity. An effective alternative is to divide them into smaller chunks to be processed separately, and this motivates a method for choosing where the chunk boundaries should be. In this paper, we propose an algorithm for segmenting videos into contiguous chunks, based on the minimum description length principle, coupled with a dynamic programming search. The algorithm is entirely parameter‑free, given feature vectors, not requiring a set threshold or the number or size of chunks to be specified. We show empirically that the breakpoints it produces more accurately approximate scene boundaries in long videos, compared with existing methods for scene detection, even when such methods have access to the true number of scenes. We then showcase this algorithm in two tasks: long video summarization, and retrieval‑augmented video question answering. In both cases, scene breaks produced by our algorithm lead to better downstream performance than existing methods for video segmentation.

Abstract:
Room reidentification (ReID) is a challenging yet essential task with numerous applications in fields such as augmented reality (AR) and homecare robotics. Existing visual place recognition (VPR) methods, which typically rely on global descriptors or aggregate local features, often struggle in cluttered indoor environments densely populated with man‑made objects. These methods tend to overlook the crucial role of object‑oriented information. To address this, we propose AirRoom, an object‑aware pipeline that integrates multi‑level object‑oriented information‑from global context to object patches, object segmentation, and keypoints‑utilizing a coarse‑to‑fine retrieval approach. Extensive experiments on four newly constructed datasets‑MPReID, HMReID, GibsonReID, and ReplicaReID‑demonstrate that AirRoom outperforms state‑of‑the‑art (SOTA) models across nearly all evaluation metrics, with improvements ranging from 6% to 80%. Moreover, AirRoom exhibits significant flexibility, allowing various modules within the pipeline to be substituted with different alternatives without compromising overall performance. It also shows robust and consistent performance under diverse viewpoint variations.

Abstract:
Object recognition and detection are well‑studied problems with a developed set of almost standard solutions. Identity documents recognition, classification, detection, and localization are the tasks required in a number of applications, particularly, in physical access control security systems at critical infrastructure premises. In this paper, we propose the new original architecture of a model based on an artificial convolutional neural network and semantic segmentation approach for the recognition and detection of identity documents in images. The challenge with the processing of such images is the limited computational performance and the limited amount of memory when such an application is running on industrial oneboard microcomputer hardware. The aim of this research is to prove the feasibility of the proposed technique and to obtain quality metrics. The methodology of the research is to evaluate the deep learning detection model trained on the mobile identity document video dataset. The dataset contains five hundred video clips for fifty different identity document types. The numerical results from simulations are used to evaluate the quality metrics. We present the results as accuracy versus threshold of the intersection over union value. The paper reports an accuracy above 0.75 for the intersection over union (IoU) threshold value of 0.8. Besides, we assessed the size of the model and proved the feasibility of running the model on an industrial one‑board microcomputer or smartphone hardware.

Abstract:
Recent years have witnessed a growing academic and industrial interest in deep learning (DL) for medical imaging. To perform well, DL models require very large labeled datasets. However, most medical imaging datasets are small, with a limited number of annotated samples. The reason they are small is usually because delineating medical images is time‑consuming and demanding for oncologists. There are various techniques that can be used to augment a dataset, for example, to apply affine transformations or elastic transformations to available images, or to add synthetic images generated by a Generative Adversarial Network (GAN). In this work, we have developed a novel conditional variant of a current GAN method, the StyleGAN2, to generate multi‑modal high‑resolution medical images with the purpose to augment small medical imaging datasets with these synthetic images. We use the synthetic and real images from six datasets to train models for the downstream task of semantic segmentation. The quality of the generated medical images and the effect of this augmentation on the segmentation performance were evaluated afterward. Finally, the results indicate that the downstream segmentation models did not benefit from the generated images. Further work and analyses are required to establish how this augmentation affects the segmentation performance.

Abstract:
Wake vortices ‑ strong, coherent air turbulences created by aircraft ‑ pose a significant risk to aviation safety and therefore require accurate and reliable detection methods. In this paper, we present an advanced, explainable machine learning method that utilizes Light Detection and Ranging (LiDAR) data for effective wake vortex detection. Our method leverages a dynamic graph CNN (DGCNN) with semantic segmentation to partition a 3D LiDAR point cloud into meaningful segments. Further refinement is achieved through clustering techniques. A novel feature of our research is the use of a perturbation‑based explanation technique, which clarifies the model's decision‑making processes for air traffic regulators and controllers, increasing transparency and building trust. Our experimental results, based on measured and simulated LiDAR scans compared against four baseline methods, underscore the effectiveness and reliability of our approach. This combination of semantic segmentation and clustering for real‑time wake vortex tracking significantly advances aviation safety measures, ensuring that these are both effective and comprehensible.

Abstract:
Model reuse offers a solution to the challenges of segmentation in biomedical imaging, where high data annotation costs remain a major bottleneck for deep learning. However, although many pretrained models are released through challenges, model zoos, and repositories, selecting the most suitable model for a new dataset remains difficult due to the lack of reliable model ranking methods. We introduce the first black‑box‑compatible framework for unsupervised and source‑free ranking of semantic and instance segmentation models based on the consistency of predictions under perturbations. While ranking methods have been studied for classification and a few segmentation‑related approaches exist, most target related tasks such as transferability estimation or model validation and typically rely on labelled data, feature‑space access, or specific training assumptions. In contrast, our method directly addresses the repository setting and applies to both semantic and instance segmentation, for zero‑shot reuse or after unsupervised domain adaptation. We evaluate the approach across a wide range of biomedical segmentation tasks in both 2D and 3D imaging, showing that our estimated rankings strongly correlate with true target‑domain model performance rankings.

Abstract:
It remains a significant challenge to compress images at extremely low bitrate while achieving both semantic consistency and high perceptual quality. Inspired by human progressive perception mechanism, we propose a Semantically Disentangled Image Compression framework (SEDIC) in this paper. Initially, an extremely compressed reference image is obtained through a learned image encoder. Then we leverage LMMs to extract essential semantic components, including overall descriptions, object detailed description, and semantic segmentation masks. We propose a training‑free Object Restoration model with Attention Guidance (ORAG) built on pre‑trained ControlNet to restore object details conditioned by object‑level text descriptions and semantic masks. Based on the proposed ORAG, we design a multistage semantic image decoder to progressively restore the details object by object, starting from the extremely compressed reference image, ultimately generating high‑quality and high‑fidelity reconstructions. Experimental results demonstrate that SEDIC significantly outperforms state‑of‑the‑art approaches, achieving superior perceptual quality and semantic consistency at extremely low‑bitrates (\le 0.05 bpp).

Abstract:
Video object segmentation (VOS) is a critical task in the development of video perception and understanding. The Segment‑Anything Model 2 (SAM 2), released by Meta AI, is the current state‑of‑the‑art architecture for end‑to‑end VOS. SAM 2 performs very well on both clean video data and augmented data, and completely intelligent video perception requires an understanding of how this architecture is capable of achieving such quality results. To better understand how each step within the SAM 2 architecture permits high‑quality video segmentation, a variety of complex video transformations are passed through the architecture, and the impact at each stage of the process is measured. It is observed that each progressive stage enables the filtering of complex transformation noise and the emphasis of the object of interest. Contributions include the creation of complex transformation video datasets, an analysis of how each stage of the SAM 2 architecture interprets these transformations, and visualizations of segmented objects through each stage. By better understanding how each model structure impacts overall video understanding, VOS development can work to improve real‑world applicability and performance tracking, localizing, and segmenting objects despite complex cluttered scenes and obscurations.

Abstract:
Real‑time video segmentation is a promising opportunity for AI‑assisted surgery, offering intraoperative guidance by identifying tools and anatomical structures. Despite growing interest in surgical video segmentation, annotation protocols vary widely across datasets ‑‑ some provide dense, frame‑by‑frame labels, while others rely on sparse annotations sampled at low frame rates such as 1 FPS. In this study, we investigate how such inconsistencies in annotation density and frame rate sampling influence the evaluation of zero‑shot segmentation models, using SAM2 as a case study for cholecystectomy procedures. Surprisingly, we find that under conventional sparse evaluation settings, lower frame rates can appear to outperform higher ones due to a smoothing effect that conceals temporal inconsistencies. However, when assessed under real‑time streaming conditions, higher frame rates yield superior segmentation stability, particularly for dynamic objects like surgical graspers. To understand how these differences align with human perception, we conducted a survey among surgeons, nurses, and machine learning engineers and found that participants consistently preferred high‑FPS segmentation overlays, reinforcing the importance of evaluating every frame in real‑time applications rather than relying on sparse sampling strategies. Our findings highlight the risk of evaluation bias that is introduced by inconsistent dataset protocols and bring attention to the need for temporally fair benchmarking in surgical video AI.

Abstract:
We introduce COU: Common Objects Underwater, an instance‑segmented image dataset of commonly found man‑made objects in multiple aquatic and marine environments. COU contains approximately 10K segmented images, annotated from images collected during a number of underwater robot field trials in diverse locations. COU has been created to address the lack of datasets with robust class coverage curated for underwater instance segmentation, which is particularly useful for training light‑weight, real‑time capable detectors for Autonomous Underwater Vehicles (AUVs). In addition, COU addresses the lack of diversity in object classes since the commonly available underwater image datasets focus only on marine life. Currently, COU contains images from both closed‑water (pool) and open‑water (lakes and oceans) environments, of 24 different classes of objects including marine debris, dive tools, and AUVs. To assess the efficacy of COU in training underwater object detectors, we use three state‑of‑the‑art models to evaluate its performance and accuracy, using a combination of standard accuracy and efficiency metrics. The improved performance of COU‑trained detectors over those solely trained on terrestrial data demonstrates the clear advantage of training with annotated underwater images. We make COU available for broad use under open‑source licenses.

Abstract:
3D vision‑language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes‑‑a six‑order‑of‑magnitude gap that severely limits performance. We introduce LIFT‑GS, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT‑GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language‑conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render‑supervised formulation enables end‑to‑end training of complete encoder‑decoder architectures and is inherently model‑agnostic. LIFT‑GS achieves state‑of‑the‑art results with 25.7% mAP on open‑vocabulary instance segmentation (vs. 20.2% prior SOTA) and consistent 10‑30% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine‑tuning datasets by 2X, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data‑scarce regime. Project page: https://liftgs.github.io

Abstract:
Masked autoencoders (MAE) have shown tremendous potential for self‑supervised learning (SSL) in vision and beyond. However, point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. Consequently, existing work suffers from leaking occupancy information into the decoder and has significant computational complexity, thereby limiting the SSL pre‑training to only 2D bird's eye view encoders in practice. In this work, we propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non‑masked voxels. We incorporate voxel masking and occupancy reconstruction at multiple scales with our proposed hierarchical mask generation technique to capture features of objects of different sizes in the point cloud. NOMAEs are extremely flexible and can be directly employed for SSL in existing 3D architectures. We perform extensive evaluations on the nuScenes and Waymo Open datasets for the downstream perception tasks of semantic segmentation and 3D object detection, comparing with both discriminative and generative SSL methods. The results demonstrate that NOMAE sets the new state‑of‑the‑art on multiple benchmarks for multiple point cloud perception tasks.

Abstract:
Robust and accurate localization is critical for autonomous driving. Traditional GNSS‑based localization methods suffer from signal occlusion and multipath effects in urban environments. Meanwhile, methods relying on high‑definition (HD) maps are constrained by the high costs associated with the construction and maintenance of HD maps. Standard‑definition (SD) maps‑based methods, on the other hand, often exhibit unsatisfactory performance or poor generalization ability due to overfitting. To address these challenges, we propose SegLocNet, a multimodal GNSS‑free localization network that achieves precise localization using bird's‑eye‑view (BEV) semantic segmentation. SegLocNet employs a BEV segmentation network to generate semantic maps from multiple sensor inputs, followed by an exhaustive matching process to estimate the vehicle's ego pose. This approach avoids the limitations of regression‑based pose estimation and maintains high interpretability and generalization. By introducing a unified map representation, our method can be applied to both HD and SD maps without any modifications to the network architecture, thereby balancing localization accuracy and area coverage. Extensive experiments on the nuScenes and Argoverse datasets demonstrate that our method outperforms the current state‑of‑the‑art methods, and that our method can accurately estimate the ego pose in urban environments without relying on GNSS, while maintaining strong generalization ability. Our code and pre‑trained model will be released publicly.

Abstract:
3D Affordance detection is a challenging problem with broad applications on various robotic tasks. Existing methods typically formulate the detection paradigm as a label‑based semantic segmentation task. This paradigm relies on predefined labels and lacks the ability to comprehend complex natural language, resulting in limited generalization in open‑world scene. To address these limitations, we reformulate the traditional affordance detection paradigm into Instruction Reasoning Affordance Segmentation (IRAS) task. This task is designed to output a affordance mask region given a query reasoning text, which avoids fixed categories of input labels. We accordingly propose the 3D‑AffordanceLLM (3D‑ADLLM), a framework designed for reasoning affordance detection in 3D open‑scene. Specifically, 3D‑ADLLM introduces large language models (LLMs) to 3D affordance perception with a custom‑designed decoder for generating affordance masks, thus achieving open‑world reasoning affordance detection. In addition, given the scarcity of 3D affordance datasets for training large models, we seek to extract knowledge from general segmentation data and transfer it to affordance detection. Thus, we propose a multi‑stage training strategy that begins with a novel pre‑training task, i.e., Referring Object Part Segmentation~(ROPS). This stage is designed to equip the model with general recognition and segmentation capabilities at the object‑part level. Then followed by fine‑tuning with the IRAS task, 3D‑ADLLM obtains the reasoning ability for affordance detection. In summary, 3D‑ADLLM leverages the rich world knowledge and human‑object interaction reasoning ability of LLMs, achieving approximately an 8% improvement in mIoU on open‑vocabulary affordance detection tasks.

Abstract:
Masked autoencoders (MAEs) represent a prominent self‑supervised learning paradigm in computer vision. Despite their empirical success, the underlying mechanisms of MAEs remain insufficiently understood. Recent studies have attempted to elucidate the functioning of MAEs through contrastive learning and feature representation analysis, yet these approaches often provide only implicit insights. In this paper, we propose a new perspective for understanding MAEs by leveraging the information bottleneck principle in information theory. Our theoretical analyses reveal that optimizing the latent features to balance relevant and irrelevant information is key to improving MAE performance. Building upon our proofs, we introduce MI‑MAE, a novel method that optimizes MAEs through mutual information maximization and minimization. By enhancing latent features to retain maximal relevant information between them and the output, and minimizing irrelevant information between them and the input, our approach achieves better performance. Extensive experiments on standard benchmarks show that MI‑MAE significantly outperforms MAE models in tasks such as image classification, object detection, and semantic segmentation. Our findings validate the theoretical framework and highlight the practical advantages of applying the information bottleneck principle to MAEs, offering deeper insights for developing more powerful self‑supervised learning models.

Abstract:
Outdoor LiDAR point cloud 3D instance segmentation is a crucial task in autonomous driving. However, it requires laborious human efforts to annotate the point cloud for training a segmentation model. To address this challenge, we propose a YoCo framework, which generates 3D pseudo labels using minimal coarse click annotations in the bird's eye view plane. It is a significant challenge to produce high‑quality pseudo labels from sparse annotations. Our YoCo framework first leverages vision foundation models combined with geometric constraints from point clouds to enhance pseudo label generation. Second, a temporal and spatial‑based label updating module is designed to generate reliable updated labels. It leverages predictions from adjacent frames and utilizes the inherent density variation of point clouds (dense near, sparse far). Finally, to further improve label quality, an IoU‑guided enhancement module is proposed, replacing pseudo labels with high‑confidence and high‑IoU predictions. Experiments on the Waymo dataset demonstrate YoCo's effectiveness and generality, achieving state‑of‑the‑art performance among weakly supervised methods and surpassing fully supervised Cylinder3D. Additionally, the YoCo is suitable for various networks, achieving performance comparable to fully supervised methods with minimal fine‑tuning using only 0.8% of the fully labeled data, significantly reducing annotation costs.

Abstract:
In this work, we present CoCal, an interpretable and consistent object parsing framework based on dictionary‑based mask transformer. Designed around Contrastive Components and Logical Constraints, CoCal rethinks existing cluster‑based mask transformer architectures used in segmentation; Specifically, CoCal utilizes a set of dictionary components, with each component being explicitly linked to a specific semantic class. To advance this concept, CoCal introduces a hierarchical formulation of dictionary components that aligns with the semantic hierarchy. This is achieved through the integration of both within‑level contrastive components and cross‑level logical constraints. Concretely, CoCal employs a component‑wise contrastive algorithm at each semantic level, enabling the contrasting of dictionary components within the same class against those from different classes. Furthermore, CoCal addresses logical concerns by ensuring that the dictionary component representing a particular part is closer to its corresponding object component than to those of other objects through a cross‑level contrastive learning objective. To further enhance our logical relation modeling, we implement a post‑processing function inspired by the principle that a pixel assigned to a part should also be assigned to its corresponding object. With these innovations, CoCal establishes a new state‑of‑the‑art performance on both PartImageNet and Pascal‑Part‑108, outperforming previous methods by a significant margin of 2.08% and 0.70% in part mIoU, respectively. Moreover, CoCal exhibits notable enhancements in object‑level metrics across these benchmarks, highlighting its capacity to not only refine parsing at a finer level but also elevate the overall quality of object segmentation.

Abstract:
An increasing number of datasets sharing similar domains for semantic segmentation have been published over the past few years. But despite the growing amount of overall data, it is still difficult to train bigger and better models due to inconsistency in taxonomy and/or labeling policies of different datasets. To this end, we propose a knowledge distillation approach that also serves as a label space unification method for semantic segmentation. In short, a teacher model is trained on a source dataset with a given taxonomy, then used to pseudo‑label additional data for which ground truth labels of a related label space exist. By mapping the related taxonomies to the source taxonomy, we create constraints within which the model can predict pseudo‑labels. Using the improved pseudo‑labels we train student models that consistently outperform their teachers in two challenging domains, namely urban and off‑road driving. Our ground truth‑corrected pseudo‑labels span over 12 and 7 public datasets with 388.230 and 18.558 images for the urban and off‑road domains, respectively, creating the largest compound datasets for autonomous driving to date.

Abstract:
Achieving optimal semantic segmentation with frame‑based vision sensors poses significant challenges for real‑time systems like UAVs and self‑driving cars, which require rapid and precise processing. Traditional frame‑based methods often struggle to balance latency, accuracy, and energy efficiency. To address these challenges, we leverage event streams from event‑based cameras‑bio‑inspired sensors that trigger events in response to changes in the scene. Specifically, we analyze the number of events triggered between successive frames, with a high number indicating significant changes and a low number indicating minimal changes. We exploit this event information to solve the semantic segmentation task by employing a Spiking Neural Network (SNN), a bio‑inspired computing paradigm known for its low energy consumption. Our experiments on the DSEC dataset show that our approach significantly reduces latency with only a limited drop in accuracy. Additionally, by using SNNs, we achieve low power consumption, making our method suitable for energy‑constrained real‑time applications. To the best of our knowledge, our approach is the first to effectively balance reduced latency, minimal accuracy loss, and energy efficiency using events stream to enhance semantic segmentation in dynamic and resource‑limited environments.

Abstract:
Multi‑teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. The core problem of multi‑teacher KD is how to balance distillation strengths among various teachers. Most existing methods often develop weighting strategies from an individual perspective of teacher performance or teacher‑student gaps, lacking comprehensive information for guidance. This paper proposes Multi‑Teacher Knowledge Distillation with Reinforcement Learning (MTKD‑RL) to optimize multi‑teacher weights. In this framework, we construct both teacher performance and teacher‑student gaps as state information to an agent. The agent outputs the teacher weight and can be updated by the return reward from the student. MTKD‑RL reinforces the interaction between the student and teacher using an agent in an RL‑based decision mechanism, achieving better matching capability with more meaningful weights. Experimental results on visual recognition tasks, including image classification, object detection, and semantic segmentation tasks, demonstrate that MTKD‑RL achieves state‑of‑the‑art performance compared to the existing multi‑teacher KD works.

Abstract:
Vision‑Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large‑scale benchmark for aerial VLN. Firstly, we integrate diverse rendering engines and advanced techniques for environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real‑to‑sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN data collection, streamlining point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Thirdly, based on the toolchain, we construct a large‑scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. Moreover, we propose OpenFly‑Agent, a keyframe‑aware VLN model emphasizing key observations during flight. For benchmarking, extensive experiments and analyses are conducted, evaluating several recent VLN methods and showcasing the superiority of our OpenFly platform and agent. The toolchain, dataset, and codes will be open‑sourced.

Abstract:
Multi‑modal learning has emerged as a key technique for improving performance across domains such as autonomous driving, robotics, and reasoning. However, in certain scenarios, particularly in resource‑constrained environments, some modalities available during training may be absent during inference. While existing frameworks effectively utilize multiple data sources during training and enable inference with reduced modalities, they are primarily designed for single‑agent settings. This poses a critical limitation in dynamic environments such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision‑making blind spots. Conversely, some works explore multi‑agent collaboration but without addressing missing modality at test time. To overcome these limitations, we propose Collaborative Auxiliary Modality Learning (CAML), a novel multi‑modal multi‑agent framework that enables agents to collaborate and share multi‑modal data during training, while allowing inference with reduced modalities during testing. Experimental results in collaborative decision‑making for CAV in accident‑prone scenarios demonstrate that CAML achieves up to a 58.1% improvement in accident detection. Additionally, we validate CAML on real‑world aerial‑ground robot data for collaborative semantic segmentation, achieving up to a 10.6% improvement in mIoU.

Abstract:
In hyperspectral remote sensing field, some downstream dense prediction tasks, such as semantic segmentation (SS) and change detection (CD), rely on supervised learning to improve model performance and require a large amount of manually annotated data for training. However, due to the needs of specific equipment and special application scenarios, the acquisition and annotation of hyperspectral images (HSIs) are often costly and time‑consuming. To this end, our work explores the potential of generative diffusion model in synthesizing HSIs with pixel‑level annotations. The main idea is to utilize a two‑stream VAE to learn the latent representations of images and corresponding masks respectively, learn their joint distribution during the diffusion model training, and finally obtain the image and mask through their respective decoders. To the best of our knowledge, it is the first work to generate high‑dimensional HSIs with annotations. Our proposed approach can be applied in various kinds of dataset generation. We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks. Experiments demonstrate that our synthetic datasets have a positive impact on the improvement of these downstream tasks.

Abstract:
Large Language Models (LLMs) are advanced deep‑learning models designed to understand and generate human language. They work together with models that process data like images, enabling cross‑modal understanding. However, existing approaches often suffer from the echo chamber effect, where redundant visual patterns reduce model generalization and accuracy. Thus, the proposed system considered this limitation and developed an enhanced LLM‑based framework for cross‑modal query understanding using DL‑KeyBERT‑based CAZSSCL‑MPGPT. The collected dataset consists of pre‑processed images and texts. The preprocessed images then undergo object segmentation using Easom‑You Only Look Once (E‑YOLO). The object skeleton is generated, along with the knowledge graph using a Conditional Random Knowledge Graph (CRKG) technique. Further, features are extracted from the knowledge graph, generated skeletons, and segmented objects. The optimal features are then selected using the Fossa Optimization Algorithm (FOA). Meanwhile, the text undergoes word embedding using DL‑KeyBERT. Finally, the cross‑modal query understanding system utilizes CAZSSCL‑MPGPT to generate accurate and contextually relevant image descriptions as text. The proposed CAZSSCL‑MPGPT achieved an accuracy of 99.14187362% in the COCO dataset 2017 and 98.43224393% in the vqav2‑val dataset.

Abstract:
We present VPNeXt, a new and simple model for the Plain Vision Transformer (ViT). Unlike the many related studies that share the same homogeneous paradigms, VPNeXt offers a fresh perspective on dense representation based on ViT. In more detail, the proposed VPNeXt addressed two concerns about the existing paradigm: (1) Is it necessary to use a complex Transformer Mask Decoder architecture to obtain good representations? (2) Does the Plain ViT really need to depend on the mock pyramid feature for upsampling? For (1), we investigated the potential underlying reasons that contributed to the effectiveness of the Transformer Decoder and introduced the Visual Context Replay (VCR) to achieve similar effects efficiently. For (2), we introduced the ViTUp module. This module fully utilizes the previously overlooked ViT real pyramid feature to achieve better upsampling results compared to the earlier mock pyramid feature. This represents the first instance of such functionality in the field of semantic segmentation for Plain ViT. We performed ablation studies on related modules to verify their effectiveness gradually. We conducted relevant comparative experiments and visualizations to show that VPNeXt achieved state‑of‑the‑art performance with a simple and effective design. Moreover, the proposed VPNeXt significantly exceeded the long‑established mIoU wall/barrier of the VOC2012 dataset, setting a new state‑of‑the‑art by a large margin, which also stands as the largest improvement since 2015.

Abstract:
We introduce Dr. Splat, a novel approach for open‑vocabulary 3D scene understanding leveraging 3D Gaussian Splatting. Unlike existing language‑embedded 3DGS methods, which rely on a rendering process, our method directly associates language‑aligned CLIP embeddings with 3D Gaussians for holistic 3D scene understanding. The key of our method is a language feature registration technique where CLIP embeddings are assigned to the dominant Gaussians intersected by each pixel‑ray. Moreover, we integrate Product Quantization (PQ) trained on general large‑scale image data to compactly represent embeddings without per‑scene optimization. Experiments demonstrate that our approach significantly outperforms existing approaches in 3D perception benchmarks, such as open‑vocabulary 3D semantic segmentation, 3D object localization, and 3D object selection tasks. For video results, please visit : https://drsplat.github.io/

Abstract:
In recent years, vision‑language models (VLMs) have advanced open‑vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high‑level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point‑wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance‑level remains challenging. To address these challenges, we introduce OpenVox, a real‑time incremental open‑vocabulary probabilistic instance voxel representation. In the front‑end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding captions. In the back‑end, we implement probabilistic instance voxels and formulate the cross‑frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenVox achieves state‑of‑the‑art performance in zero‑shot instance segmentation, semantic segmentation, and open‑vocabulary retrieval. Furthermore, real‑world robotics experiments validate OpenVox's capability for stable, real‑time operation.

Abstract:
Introduction: Computer vision (CV) has had a transformative impact in biomedical fields such as radiology, dermatology, and pathology. Its real‑world adoption in surgical applications, however, remains limited. We review the current state‑of‑the‑art performance of deep learning (DL)‑based CV models for segmentation and object detection of anatomical structures in videos obtained during surgical procedures. Methods: We conducted a scoping review of studies on semantic segmentation and object detection of anatomical structures published between 2014 and 2024 from 3 major databases ‑ PubMed, Embase, and IEEE Xplore. The primary objective was to evaluate the state‑of‑the‑art performance of semantic segmentation in surgical videos. Secondary objectives included examining DL models, progress toward clinical applications, and the specific challenges with segmentation of organs/tissues in surgical videos. Results: We identified 58 relevant published studies. These focused predominantly on procedures from general surgery [20(34.4%)], colorectal surgery [9(15.5%)], and neurosurgery [8(13.8%)]. Cholecystectomy [14(24.1%)] and low anterior rectal resection [5(8.6%)] were the most common procedures addressed. Semantic segmentation [47(81%)] was the primary CV task. U‑Net [14(24.1%)] and DeepLab [13(22.4%)] were the most widely used models. Larger organs such as the liver (Dice score: 0.88) had higher accuracy compared to smaller structures such as nerves (Dice score: 0.49). Models demonstrated real‑time inference potential ranging from 5‑298 frames‑per‑second (fps). Conclusion: This review highlights the significant progress made in DL‑based semantic segmentation for surgical videos with real‑time applicability, particularly for larger organs. Addressing challenges with smaller structures, data availability, and generalizability remains crucial for future advancements.

Abstract:
Achieving a consistent and compact 3D segmentation field is crucial for maintaining semantic coherence across views and accurately representing scene structures. Previous 3D scene segmentation methods rely on video segmentation models to address inconsistencies across views, but the absence of spatial information often leads to object misassociation when object temporarily disappear and reappear. Furthermore, in the process of 3D scene reconstruction, segmentation and optimization are often treated as separate tasks. As a result, optimization typically lacks awareness of semantic category information, which can result in floaters with ambiguous segmentation. To address these challenges, we introduce CCGS, a method designed to achieve both view consistent 2D segmentation and a compact 3D Gaussian segmentation field. CCGS incorporates pointmap association and a piecewise‑plane constraint. First, we establish pixel correspondence between adjacent images by minimizing the Euclidean distance between their pointmaps. We then redefine object mask overlap accordingly. The Hungarian algorithm is employed to optimize mask association by minimizing the total matching cost, while allowing for partial matches. To further enhance compactness, the piecewise‑plane constraint restricts point displacement within local planes during optimization, thereby preserving structural integrity. Experimental results on ScanNet and Replica datasets demonstrate that CCGS outperforms existing methods in both 2D panoptic segmentation and 3D Gaussian segmentation.

Abstract:
The increasing impact of human‑induced climate change and unplanned urban constructions has increased flooding incidents in recent years. Accurate identification of flooded areas is crucial for effective disaster management and urban planning. While few works have utilized convolutional neural networks and transformer‑based semantic segmentation techniques for identifying flooded areas from aerial footage, recent developments in graph neural networks have created improvement opportunities. This paper proposes an innovative approach, the Graph Attention Convolutional U‑NET (GAC‑UNET) model, based on graph neural networks for automated identification of flooded areas. The model incorporates a graph attention mechanism and Chebyshev layers into the U‑Net architecture. Furthermore, this paper explores the applicability of transfer learning and model reprogramming to enhance the accuracy of flood area segmentation models. Empirical results demonstrate that the proposed GAC‑UNET model, outperforms other approaches with 91% mAP, 94% dice score, and 89% IoU, providing valuable insights for informed decision‑making and better planning of future infrastructures in flood‑prone areas.

Abstract:
Existing communication systems aim to reconstruct the information at the receiver side, and are known as reconstruction‑oriented communications. This approach often falls short in meeting the real‑time, task‑specific demands of modern AI‑driven applications such as autonomous driving and semantic segmentation. As a new design principle, task‑oriented communications have been developed. However, it typically requires joint optimization of encoder, decoder, and modified inference neural networks, resulting in extensive cross‑system redesigns and compatibility issues. This paper proposes a novel communication framework that aligns reconstruction‑oriented and task‑oriented communications for edge intelligence. The idea is to extend the Information Bottleneck (IB) theory to optimize data transmission by minimizing task‑relevant loss function, while maintaining the structure of the original data by an information reshaper. Such an approach integrates task‑oriented communications with reconstruction‑oriented communications, where a variational approach is designed to handle the intractability of mutual information in high‑dimensional neural network features. We also introduce a joint source‑channel coding (JSCC) modulation scheme compatible with classical modulation techniques, enabling the deployment of AI technologies within existing digital infrastructures. The proposed framework is particularly effective in edge‑based autonomous driving scenarios. Our evaluation in the Car Learning to Act (CARLA) simulator demonstrates that the proposed framework significantly reduces bits per service by 99.19% compared to existing methods, such as JPEG, JPEG2000, and BPG, without compromising the effectiveness of task execution.

Abstract:
Object extraction and segmentation from remote sensing (RS) images is a critical yet challenging task in urban environment monitoring. Urban morphology is inherently complex, with irregular objects of diverse shapes and varying scales. These challenges are amplified by heterogeneity and scale disparities across RS data sources, including sensors, platforms, and modalities, making accurate object segmentation particularly demanding. While the Segment Anything Model (SAM) has shown significant potential in segmenting complex scenes, its performance in handling form‑varying objects remains limited due to manual‑interactive prompting. To this end, we propose UrbanSAM, a customized version of SAM specifically designed to analyze complex urban environments while tackling scaling effects from remotely sensed observations. Inspired by multi‑resolution analysis (MRA) theory, UrbanSAM incorporates a novel learnable prompter equipped with a Uscaling‑Adapter that adheres to the invariance criterion, enabling the model to capture multiscale contextual information of objects and adapt to arbitrary scale variations with theoretical guarantees. Furthermore, features from the Uscaling‑Adapter and the trunk encoder are aligned through a masked cross‑attention operation, allowing the trunk encoder to inherit the adapter's multiscale aggregation capability. This synergy enhances the segmentation performance, resulting in more powerful and accurate outputs, supported by the learned adapter. Extensive experimental results demonstrate the flexibility and superior segmentation performance of the proposed UrbanSAM on a global‑scale dataset, encompassing scale‑varying urban objects such as buildings, roads, and water.

Abstract:
Bird's Eye View (BEV) semantic maps have recently garnered a lot of attention as a useful representation of the environment to tackle assisted and autonomous driving tasks. However, most of the existing work focuses on the fully supervised setting, training networks on large annotated datasets. In this work, we present RendBEV, a new method for the self‑supervised training of BEV semantic segmentation networks, leveraging differentiable volumetric rendering to receive supervision from semantic perspective views computed by a 2D semantic segmentation model. Our method enables zero‑shot BEV semantic segmentation, and already delivers competitive results in this challenging setting. When used as pretraining to then fine‑tune on labeled BEV ground‑truth, our method significantly boosts performance in low‑annotation regimes, and sets a new state of the art when fine‑tuning on all available labels.

Abstract:
The significant effort required to annotate data for new training datasets hinders computer vision research and machine learning in the construction industry. This work explores adapting standard datasets and the latest transformer model architectures for point cloud semantic segmentation in the context of shell construction sites. Unlike common approaches focused on object segmentation of building interiors and furniture, this study addressed the challenges of segmenting complex structural components in Architecture, Engineering, and Construction (AEC). We establish a baseline through supervised training and a custom validation dataset, evaluate the cross‑domain inference with large‑scale indoor datasets, and utilize transfer learning to maximize segmentation performance with minimal new data. The findings indicate that with minimal fine‑tuning, pre‑trained transformer architectures offer an effective strategy for building component segmentation. Our results are promising for automating the annotation of new, previously unseen data when creating larger training resources and for the segmentation of frequently recurring objects.

Abstract:
Real‑time Magnetic Resonance Imaging (rtMRI) is frequently used in speech production studies as it provides a complete view of the vocal tract during articulation. This study investigates the effectiveness of rtMRI in analyzing vocal tract movements by employing the SegNet and UNet models for Air‑Tissue Boundary (ATB)segmentation tasks. We conducted pretraining of a few base models using increasing numbers of subjects and videos, to assess performance on two datasets. First, consisting of unseen subjects with unseen videos from the same data source, achieving 0.33% and 0.91% (Pixel‑wise Classification Accuracy (PCA) and Dice Coefficient respectively) better than its matched condition. Second, comprising unseen videos from a new data source, where we obtained an accuracy of 99.63% and 98.09% (PCA and Dice Coefficient respectively) of its matched condition performance. Here, matched condition performance refers to the performance of a model trained only on the test subjects which was set as a benchmark for the other models. Our findings highlight the significance of fine‑tuning and adapting models with limited data. Notably, we demonstrated that effective model adaptation can be achieved with as few as 15 rtMRI frames from any new dataset.

Abstract:
Integrating hyperspectral imagery (HSI) with deep neural networks (DNNs) can strengthen the accuracy of intelligent vision systems by combining spectral and spatial information, which is useful for tasks like semantic segmentation in autonomous driving. To advance research in such safety‑critical systems, determining the precise contribution of spectral information to complex DNNs' output is needed. To address this, several saliency methods, such as class activation maps (CAM), have been proposed primarily for image classification. However, recent studies have raised concerns regarding their reliability. In this paper, we address their limitations and propose an alternative approach by leveraging the data provided by activations and weights from relevant DNN layers to better capture the relationship between input features and predictions. The study aims to assess the superior performance of HSI compared to 3‑channel and single‑channel DNNs. We also address the influence of spectral signature normalization for enhancing DNN robustness in real‑world driving conditions.

Abstract:
Understanding the relationship between the evolution of microstructures of irradiated LiAlO2 pellets and tritium diffusion, retention and release could improve predictions of tritium‑producing burnable absorber rod performance. Given expert‑labeled segmented images of irradiated and unirradiated pellets, we trained Deep Convolutional Neural Networks to segment images into defect, grain, and boundary classes. Qualitative microstructural information was calculated from these segmented images to facilitate the comparison of unirradiated and irradiated pellets. We tested modifications to improve the sensitivity of the model, including incorporating meta‑data into the model and utilizing uncertainty quantification. The predicted segmentation was similar to the expert‑labeled segmentation for most methods of microstructural qualification, including pixel proportion, defect area, and defect density. Overall, the high performance metrics for the best models for both irradiated and unirradiated images shows that utilizing neural network models is a viable alternative to expert‑labeled images.

Abstract:
Infrared sensing is a core method for supporting unmanned systems, such as autonomous vehicles and drones. Recently, infrared sensors have been widely deployed on mobile and stationary platforms for detection and classification of objects from long distances and in wide field of views. Given its success in the vision image analysis domain, deep learning has also been applied for object recognition in infrared images. However, techniques that have proven successful in visible light perception face new challenges in the infrared domain. These challenges include extremely low signal‑to‑noise ratios in infrared images, very small and blurred objects of interest, and limited availability of labeled/unlabeled training data due to the specialized nature of infrared sensors. Numerous methods have been proposed in the literature for the detection and classification of small objects in infrared images achieving varied levels of success. There is a need for a survey paper that critically analyzes existing techniques in this domain, identifies unsolved challenges and provides future research directions. This paper fills the gap and offers a concise and insightful review of deep learning‑based methods. It also identifies the challenges faced by existing infrared object segmentation methods and provides a structured review of existing infrared perception methods from the perspective of these challenges and highlights the motivations behind the various approaches. Finally, this review suggests promising future directions based on recent advancements within this domain.

Abstract:
The complexity of scenes and variations in image quality result in significant variability in the performance of semantic segmentation methods of remote sensing imagery (RSI) in supervised real‑world scenarios. This makes the evaluation of semantic segmentation quality in such scenarios an issue to be resolved. However, most of the existing evaluation metrics are developed based on expert‑labeled object‑level annotations, which are not applicable in such scenarios. To address this issue, we propose RS‑SQA, an unsupervised quality assessment model for RSI semantic segmentation based on vision language model (VLM). This framework leverages a pre‑trained RS VLM for semantic understanding and utilizes intermediate features from segmentation methods to extract implicit information about segmentation quality. Specifically, we introduce CLIP‑RS, a large‑scale pre‑trained VLM trained with purified text to reduce textual noise and capture robust semantic information in the RS domain. Feature visualizations confirm that CLIP‑RS can effectively differentiate between various levels of segmentation quality. Semantic features and low‑level segmentation features are effectively integrated through a semantic‑guided approach to enhance evaluation accuracy. To further support the development of RS semantic segmentation quality assessment, we present RS‑SQED, a dedicated dataset sampled from four major RS semantic segmentation datasets and annotated with segmentation accuracy derived from the inference results of 8 representative segmentation methods. Experimental results on the established dataset demonstrate that RS‑SQA significantly outperforms state‑of‑the‑art quality assessment models. This provides essential support for predicting segmentation accuracy and high‑quality semantic segmentation interpretation, offering substantial practical value.

Abstract:
Medical image segmentation plays a crucial role in various clinical applications. A major challenge in medical image segmentation is achieving accurate delineation of regions of interest in the presence of noise, low contrast, or complex anatomical structures. Existing segmentation models often neglect the integration of multi‑grained information and fail to preserve edge details, which are critical for precise segmentation. To address these challenges, we propose a novel image semantic segmentation model called the Multi‑Grained Feature Integration Network (MGFI‑Net). Our MGFI‑Net is designed with two dedicated modules to tackle these issues. First, to enhance segmentation accuracy, we introduce a Multi‑Grained Feature Extraction Module, which leverages hierarchical relationships between different feature scales to selectively focus on the most relevant information. Second, to preserve edge details, we incorporate an Edge Enhancement Module that effectively retains and integrates boundary information to refine segmentation results. Extensive experiments demonstrate that MGFI‑Net not only outperforms state‑of‑the‑art methods in terms of segmentation accuracy but also achieves superior time efficiency, establishing it as a leading solution for real‑time medical image segmentation.

Abstract:
Ensuring the safety and reliability of power grids is critical as global energy demands continue to rise. Traditional inspection methods, such as manual observations or helicopter surveys, are resource‑intensive and lack scalability. This paper explores the use of 3D computer vision to automate power grid inspections, utilizing the TS40K dataset ‑‑ a high‑density, annotated collection of 3D LiDAR point clouds. By concentrating on 3D semantic segmentation, our approach addresses challenges like class imbalance and noisy data to enhance the detection of critical grid components such as power lines and towers. The benchmark results indicate significant performance improvements, with IoU scores reaching 95.53% for the detection of power lines using transformer‑based models. Our findings illustrate the potential for integrating ML into grid maintenance workflows, increasing efficiency and enabling proactive risk management strategies.

Abstract:
Open‑vocabulary semantic segmentation enables models to identify novel object categories beyond their training data. While this flexibility represents a significant advancement, current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real‑world applications. This work proposes a Vocabulary‑Free Semantic Segmentation pipeline, eliminating the need for predefined class vocabularies. Specifically, we address the chicken‑and‑egg problem where users need knowledge of all potential objects within a scene to identify them, yet the purpose of segmentation is often to discover these objects. The proposed approach leverages Vision‑Language Models to automatically recognize objects and generate appropriate class names, aiming to solve the challenge of class specification and naming quality. Through extensive experiments on several public datasets, we highlight the crucial role of the text encoder in model performance, particularly when the image text classes are paired with generated descriptions. Despite the challenges introduced by the sensitivity of the segmentation text encoder to false negatives within the class tagging process, which adds complexity to the task, we demonstrate that our fully automated pipeline significantly enhances vocabulary‑free segmentation accuracy across diverse real‑world scenarios.

Abstract:
A robust and efficient traffic monitoring system is essential for smart cities and Intelligent Transportation Systems (ITS), using sensors and cameras to track vehicle movements, optimize traffic flow, reduce congestion, enhance road safety, and enable real‑time adaptive traffic control. Traffic monitoring models must comprehensively understand dynamic urban conditions and provide an intuitive user interface for effective management. This research leverages the LLaVA visual grounding multimodal large language model (LLM) for traffic monitoring tasks on the real‑time Quanser Interactive Lab simulation platform, covering scenarios like intersections, congestion, and collisions. Cameras placed at multiple urban locations collect real‑time images from the simulation, which are fed into the LLaVA model with queries for analysis. An instance segmentation model integrated into the cameras highlights key elements such as vehicles and pedestrians, enhancing training and throughput. The system achieves 84.3% accuracy in recognizing vehicle locations and 76.4% in determining steering direction, outperforming traditional models.

Abstract:
Referring Medical Image Sequence Segmentation (Ref‑MISS) is a novel and challenging task that aims to segment anatomical structures in medical image sequences (\emphe.g. endoscopy, ultrasound, CT, and MRI) based on natural language descriptions. This task holds significant clinical potential and offers a user‑friendly advancement in medical imaging interpretation. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences, and lack support for nteractive, text‑driven guidance. To address these limitations, we propose Text‑Promptable Propagation (TPP), a model designed for referring medical image sequence segmentation. TPP captures the intrinsic relationships among sequential images along with their associated textual descriptions. Specifically, it enables the recognition of referred objects through cross‑modal referring interaction, and maintains continuous tracking across the sequence via Transformer‑based triple propagation, using text embeddings as queries. To support this task, we curate a large‑scale benchmark, Ref‑MISS‑Bench, which covers 4 imaging modalities and 20 different organs and lesions. Experimental results on this benchmark demonstrate that TPP consistently outperforms state‑of‑the‑art methods in both medical segmentation and referring video object segmentation.

Abstract:
Finding the cadastral boundaries of farmlands is a crucial concern for land administration. Therefore, using deep learning methods to expedite and simplify the extraction of cadastral boundaries from satellite and unmanned aerial vehicle (UAV) images is critical. In this paper, we employ transfer learning to train a U‑Net model with a ResNet34 backbone to detect cadastral boundaries through three‑class semantic segmentation: "boundary", "field", and "background". We evaluate the performance on two satellite images from farmlands in Iran using "precision", "recall", and "F‑score", achieving high values of 88%, 75%, and 81%, respectively, which indicate promising results.

Abstract:
Semantic segmentation is an important task for autonomous driving. A powerful autonomous driving system should be capable of handling images under all conditions, including nighttime. Generating accurate and diverse nighttime semantic segmentation datasets is crucial for enhancing the performance of computer vision algorithms in low‑light conditions. In this thesis, we introduce a novel approach named NPSim, which enables the simulation of realistic nighttime images from real daytime counterparts with monocular inverse rendering and ray tracing. NPSim comprises two key components: mesh reconstruction and relighting. The mesh reconstruction component generates an accurate representation of the scene structure by combining geometric information extracted from the input RGB image and semantic information from its corresponding semantic labels. The relighting component integrates real‑world nighttime light sources and material characteristics to simulate the complex interplay of light and object surfaces under low‑light conditions. The scope of this thesis mainly focuses on the implementation and evaluation of the mesh reconstruction component. Through experiments, we demonstrate the effectiveness of the mesh reconstruction component in producing high‑quality scene meshes and their generality across different autonomous driving datasets. We also propose a detailed experiment plan for evaluating the entire pipeline, including both quantitative metrics in training state‑of‑the‑art supervised and unsupervised semantic segmentation approaches and human perceptual studies, aiming to indicate the capability of our approach to generate realistic nighttime images and the value of our dataset in steering future progress in the field.

Abstract:
Dental panoramic radiographs (DPRs) are widely used in clinical practice for comprehensive oral assessment but present challenges due to overlapping structures and time constraints in interpretation. This study aimed to establish a solid baseline for the AI‑automated assessment of findings in DPRs by developing, evaluating an AI system, and comparing its performance with that of human readers across multinational data sets. We analyzed 6,669 DPRs from three data sets (the Netherlands, Brazil, and Taiwan), focusing on 8 types of dental findings. The AI system combined object detection and semantic segmentation techniques for per‑tooth finding identification. Performance metrics included sensitivity, specificity, and area under the receiver operating characteristic curve (AUC‑ROC). AI generalizability was tested across data sets, and performance was compared with human dental practitioners. The AI system demonstrated comparable or superior performance to human readers, particularly +67.9% (95% CI: 54.0%‑81.9%; p < .001) sensitivity for identifying periapical radiolucencies and +4.7% (95% CI: 1.4%‑8.0%; p = .008) sensitivity for identifying missing teeth. The AI achieved a macro‑averaged AUC‑ROC of 96.2% (95% CI: 94.6%‑97.8%) across 8 findings. AI agreements with the reference were comparable to inter‑human agreements in 7 of 8 findings except for caries (p = .024). The AI system demonstrated robust generalization across diverse imaging and demographic settings and processed images 79 times faster (95% CI: 75‑82) than human readers. The AI system effectively assessed findings in DPRs, achieving performance on par with or better than human experts while significantly reducing interpretation time. These results highlight the potential for integrating AI into clinical workflows to improve diagnostic efficiency and accuracy, and patient management.

Abstract:
Purpose: Foundation models, trained on multitudes of public datasets, often require additional fine‑tuning or re‑prompting mechanisms to be applied to visually distinct target domains such as surgical videos. Further, without domain knowledge, they cannot model the specific semantics of the target domain. Hence, when applied to surgical video segmentation, they fail to generalise to sections where previously tracked objects leave the scene or new objects enter. Methods: We propose SASVi, a novel re‑prompting mechanism based on a frame‑wise Mask R‑CNN Overseer model, which is trained on a minimal amount of scarcely available annotations for the target domain. This model automatically re‑prompts the foundation model SAM2 when the scene constellation changes, allowing for temporally smooth and complete segmentation of full surgical videos. Results: Re‑prompting based on our Overseer model significantly improves the temporal consistency of surgical video segmentation compared to similar prompting techniques and especially frame‑wise segmentation, which neglects temporal information, by at least 1.5%. Our proposed approach allows us to successfully deploy SAM2 to surgical videos, which we quantitatively and qualitatively demonstrate for three different cholecystectomy and cataract surgery datasets. Conclusion: SASVi can serve as a new baseline for smooth and temporally consistent segmentation of surgical videos with scarcely available annotation data. Our method allows us to leverage scarce annotations and obtain complete annotations for full videos of the large‑scale counterpart datasets. We make those annotations publicly available, providing extensive annotation data for the future development of surgical data science models.

Abstract:
Sketch segmentation involves grouping pixels within a sketch that belong to the same object or instance. It serves as a valuable tool for sketch editing tasks, such as moving, scaling, or removing specific components. While image segmentation models have demonstrated remarkable capabilities in recent years, sketches present unique challenges for these models due to their sparse nature and wide variation in styles. We introduce InkLayer, a method for instance segmentation of raster scene sketches. Our approach adapts state‑of‑the‑art image segmentation and object detection models to the sketch domain by employing class‑agnostic fine‑tuning and refining segmentation masks using depth cues. Furthermore, our method organizes sketches into sorted layers, where occluded instances are inpainted, enabling advanced sketch editing applications. As existing datasets in this domain lack variation in sketch styles, we construct a synthetic scene sketch segmentation dataset, InkScenes, featuring sketches with diverse brush strokes and varying levels of detail. We use this dataset to demonstrate the robustness of our approach.

Abstract:
This work introduces Semantically Masked Vector Quantized Generative Adversarial Network (SQ‑GAN), a novel approach integrating semantically driven image coding and vector quantization to optimize image compression for semantic/task‑oriented communications. The method only acts on source coding and is fully compliant with legacy systems. The semantics is extracted from the image computing its semantic segmentation map using off‑the‑shelf software. A new specifically developed semantic‑conditioned adaptive mask module (SAMM) selectively encodes semantically relevant features of the image. The relevance of the different semantic classes is task‑specific, and it is incorporated in the training phase by introducing appropriate weights in the loss function. SQ‑GAN outperforms state‑of‑the‑art image compression schemes such as JPEG2000, BPG, and deep‑learning based methods across multiple metrics, including perceptual quality and semantic segmentation accuracy on the reconstructed image, at extremely low compression rates.

Abstract:
3D scene understanding is a critical yet challenging task in autonomous driving due to the irregularity and sparsity of LiDAR data, as well as the computational demands of processing large‑scale point clouds. Recent methods leverage range‑view representations to enhance efficiency, but they often adopt higher azimuth resolutions to mitigate information loss during spherical projection, where only the closest point is retained for each 2D grid. However, processing wide panoramic range‑view images remains inefficient and may introduce additional distortions. Our empirical analysis shows that training with multiple range images, obtained from splitting the full point cloud, improves both segmentation accuracy and computational efficiency. However, this approach also poses new challenges of exacerbated class imbalance and increase in projection artifacts. To address these, we introduce FLARES, a novel training paradigm that incorporates two tailored data augmentation techniques and a specialized post‑processing method designed for multi‑range settings. Extensive experiments demonstrate that FLARES is highly generalizable across different architectures, yielding 2.1%~7.9% mIoU improvements on SemanticKITTI and 1.8%~3.9% mIoU on nuScenes, while delivering over 40% speed‑up in inference.

Abstract:
Medical images are often high‑resolution and lose important detail if downsampled, making pixel‑level methods such as semantic segmentation much less efficient if performed on a low‑dimensional image. We propose a low‑rank Matryoshka projection and a hybrid segmenting architecture that preserves important information while retaining sufficient pixel geometry for pixel‑level tasks. We design the Matryoshka Autoencoder (MatAE‑U‑Net) which combines the hierarchical encoding of the Matryoshka Autoencoder with the spatial reconstruction capabilities of a U‑Net decoder, leveraging multi‑scale feature extraction and skip connections to enhance accuracy and generalisation. We apply it to the problem of segmenting the left ventricle (LV) in echocardiographic images using the Stanford EchoNet‑D dataset, including 1,000 standardised video‑mask pairs of cardiac ultrasound videos resized to 112x112 pixels. The MatAE‑UNet model achieves a Mean IoU of 77.68%, Mean Pixel Accuracy of 97.46%, and Dice Coefficient of 86.91%, outperforming the baseline U‑Net, which attains a Mean IoU of 74.70%, Mean Pixel Accuracy of 97.31%, and Dice Coefficient of 85.20%. The results highlight the potential of using the U‑Net in the recursive Matroshka latent space for imaging problems with low‑contrast such as echocardiographic analysis.

Abstract:
Precise segmentation and classification of cell instances are vital for analyzing the tissue microenvironment in histology images, supporting medical diagnosis, prognosis, treatment planning, and studies of brain cytoarchitecture. However, the creation of high‑quality annotated datasets for training remains a major challenge. This study introduces a novel single‑stage approach (HistoSmith) for generating image‑label pairs to augment histology datasets. Unlike state‑of‑the‑art methods that utilize diffusion models with separate components for label and image generation, our approach employs a latent diffusion model to learn the joint distribution of cellular layouts, classification masks, and histology images. This model enables tailored data generation by conditioning on user‑defined parameters such as cell types, quantities, and tissue types. Trained on the Conic H&E histopathology dataset and the Nissl‑stained CytoDArk0 dataset, the model generates realistic and diverse labeled samples. Experimental results demonstrate improvements in cell instance segmentation and classification, particularly for underrepresented cell types like neutrophils in the Conic dataset. These findings underscore the potential of our approach to address data scarcity challenges.

Abstract:
This work addresses the task of generalized class discovery (GCD) in instance segmentation. The goal is to discover novel classes and obtain a model capable of segmenting instances of both known and novel categories, given labeled and unlabeled data. Since the real world contains numerous objects with long‑tailed distributions, the instance distribution for each class is inherently imbalanced. To address the imbalanced distributions, we propose an instance‑wise temperature assignment (ITA) method for contrastive learning and class‑wise reliability criteria for pseudo‑labels. The ITA method relaxes instance discrimination for samples belonging to head classes to enhance GCD. The reliability criteria are to avoid excluding most pseudo‑labels for tail classes when training an instance segmentation network using pseudo‑labels from GCD. Additionally, we propose dynamically adjusting the criteria to leverage diverse samples in the early stages while relying only on reliable pseudo‑labels in the later stages. We also introduce an efficient soft attention module to encode object‑specific representations for GCD. Finally, we evaluate our proposed method by conducting experiments on two settings: COCO_half + LVIS and LVIS + Visual Genome. The experimental results demonstrate that the proposed method outperforms previous state‑of‑the‑art methods.

Abstract:
Extending the translation equivariance property of convolutional neural networks to larger symmetry groups has been shown to reduce sample complexity and enable more discriminative feature learning. Further, exploiting additional symmetries facilitates greater weight sharing than standard convolutions, leading to an enhanced network expressivity without an increase in parameter count. However, extending the equivariant properties of a convolution layer comes at a computational cost. In particular, for 3D data, expanding equivariance to the SE(3) group (rotation and translation) results in a 6D convolution operation, which is not tractable for larger data samples such as 3D scene scans. While efforts have been made to develop efficient SE(3) equivariant networks, existing approaches rely on discretization or only introduce global rotation equivariance. This limits their applicability to point clouds representing a scene composed of multiple objects. This work presents an efficient, continuous, and local SE(3) equivariant convolution layer for point cloud processing based on general group convolution and local reference frames. Our experiments show that our approach achieves competitive or superior performance across a range of datasets and tasks, including object classification and semantic segmentation, with negligible computational overhead.

Abstract:
Deep Neural Networks (DNNs) can be catastrophically disrupted by flipping only a handful of parameter bits. We introduce Deep Neural Lesion (DNL), a data‑free and optimizationfree method that locates critical parameters, and an enhanced single‑pass variant, 1P‑DNL, that refines this selection with one forward and backward pass on random inputs. We show that this vulnerability spans multiple domains, including image classification, object detection, instance segmentation, and reasoning large language models. In image classification, flipping just two sign bits in ResNet‑50 on ImageNet reduces accuracy by 99.8%. In object detection and instance segmentation, one or two sign flips in the backbone collapse COCO detection and mask AP for Mask R‑CNN and YOLOv8‑seg models. In language modeling, two sign flips into different experts reduce Qwen3‑30B‑A3B‑Thinking from 78% to 0% accuracy. We also show that selectively protecting a small fraction of vulnerable sign bits provides a practical defense against such attacks.

Abstract:
Transformers have become foundational for visual tasks such as object detection, semantic segmentation, and video understanding, but their quadratic complexity in attention mechanisms presents scalability challenges. To address these limitations, the Mamba architecture utilizes state‑space models (SSMs) for linear scalability, efficient processing, and improved contextual awareness. This paper investigates Mamba architecture for visual domain applications and its recent advancements, including Vision Mamba (ViM) and VideoMamba, which introduce bidirectional scanning, selective scanning mechanisms, and spatiotemporal processing to enhance image and video understanding. Architectural innovations like position embeddings, cross‑scan modules, and hierarchical designs further optimize the Mamba framework for global and local feature extraction. These advancements position Mamba as a promising architecture in computer vision research and applications.

Abstract:
Medical images often exhibit low and blurred contrast between lesions and surrounding tissues, with considerable variation in lesion edges and shapes even within the same disease, leading to significant challenges in segmentation. Therefore, precise segmentation of lesions has become an essential prerequisite for patient condition assessment and formulation of treatment plans. Significant achievements have been made in research related to the U‑Net model in recent years. It improves segmentation performance and is extensively applied in the semantic segmentation of medical images to offer technical support for consistent quantitative lesion analysis methods. First, this paper classifies medical image datasets on the basis of their imaging modalities and then examines U‑Net and its various improvement models from the perspective of structural modifications. The research objectives, innovative designs, and limitations of each approach are discussed in detail. Second, we summarize the four central improvement mechanisms of the U‑Net and U‑Net variant algorithms: the jump‑connection mechanism, residual‑connection mechanism, 3D‑UNet, and transformer mechanism. Finally, we examine the relationships among the four core enhancement mechanisms and commonly utilized medical datasets and propose potential avenues and strategies for future advancements. This paper provides a systematic summary and reference for researchers in related fields, and we look forward to designing more efficient and stable medical image segmentation network models based on the U‑Net network.

Abstract:
Recent works modify CLIP to perform open‑vocabulary semantic segmentation in a training‑free manner (TF‑OVSS). In vanilla CLIP, patch‑wise image representations mainly encode homogeneous image‑level properties, which hinders the application of CLIP to the dense prediction task. Previous TF‑OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF‑OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last‑block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last‑block attention with image‑level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global‑token emerging blocks with the Query‑Query attention. Secondly, we aim to make Value embeddings of the last‑block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state‑of‑the‑arts.

Abstract:
Active vision enables dynamic visual perception, offering an alternative to static feedforward architectures in computer vision, which rely on large datasets and high computational resources. Biological selective attention mechanisms allow agents to focus on salient Regions of Interest (ROIs), reducing computational demand while maintaining real‑time responsiveness. Event‑based cameras, inspired by the mammalian retina, enhance this capability by capturing asynchronous scene changes enabling efficient low‑latency processing. To distinguish moving objects while the event‑based camera is in motion the agent requires an object motion segmentation mechanism to accurately detect targets and center them in the visual field (fovea). Integrating event‑based sensors with neuromorphic algorithms represents a paradigm shift, using Spiking Neural Networks to parallelize computation and adapt to dynamic environments. This work presents a Spiking Convolutional Neural Network bioinspired attention system for selective attention through object motion sensitivity. The system generates events via fixational eye movements using a Dynamic Vision Sensor integrated into the Speck neuromorphic hardware, mounted on a Pan‑Tilt unit, to identify the ROI and saccade toward it. The system, characterized using ideal gratings and benchmarked against the Event Camera Motion Segmentation Dataset, reaches a mean IoU of 82.2% and a mean SSIM of 96% in multi‑object motion segmentation. The detection of salient objects reaches 88.8% accuracy in office scenarios and 89.8% in low‑light conditions on the Event‑Assisted Low‑Light Video Object Segmentation Dataset. A real‑time demonstrator shows the system's 0.12 s response to dynamic scenes. Its learning‑free design ensures robustness across perceptual scenes, making it a reliable foundation for real‑time robotic applications serving as a basis for more complex architectures.

Abstract:
The recent advancements in generative AI techniques, which have significantly increased the online dissemination of altered images and videos, have raised serious concerns about the credibility of digital media available on the Internet and distributed through information channels and social networks. This issue particularly affects domains that rely heavily on trustworthy data, such as journalism, forensic analysis, and Earth observation. To address these concerns, the ability to geolocate a non‑geo‑tagged ground‑view image without external information, such as GPS coordinates, has become increasingly critical. This study tackles the challenge of linking a ground‑view image, potentially exhibiting varying fields of view (FoV), to its corresponding satellite image without the aid of GPS data. To achieve this, we propose a novel four‑stream Siamese‑like architecture, the Quadruple Semantic Align Net (SAN‑QUAD), which extends previous state‑of‑the‑art (SOTA) approaches by leveraging semantic segmentation applied to both ground and satellite imagery. Experimental results on a subset of the CVUSA dataset demonstrate significant improvements of up to 9.8% over prior methods across various FoV settings.

Abstract:
Point clouds captured with laser scanning systems from forest environments can be utilized in a wide variety of applications within forestry and plant ecology, such as the estimation of tree stem attributes, leaf angle distribution, and above‑ground biomass. However, effectively utilizing the data in such tasks requires the semantic segmentation of the data into wood and foliage points, also known as leaf‑wood separation. The traditional approach to leaf‑wood separation has been geometry‑ and radiometry‑based unsupervised algorithms, which tend to perform poorly on data captured with airborne laser scanning (ALS) systems, even with a high point density. While recent machine and deep learning approaches achieve great results even on sparse point clouds, they require manually labeled training data, which is often extremely laborious to produce. Multispectral (MS) information has been demonstrated to have potential for improving the accuracy of leaf‑wood separation, but quantitative assessment of its effects has been lacking. This study proposes a fully unsupervised deep learning method, GrowSP‑ForMS, which is specifically designed for leaf‑wood separation of high‑density MS ALS point clouds and based on the GrowSP architecture. GrowSP‑ForMS achieved a mean accuracy of 84.3% and a mean intersection over union (mIoU) of 69.6% on our MS test set, outperforming the unsupervised reference methods by a significant margin. When compared to supervised deep learning methods, our model performed similarly to the slightly older PointNet architecture but was outclassed by more recent approaches. Finally, two ablation studies were conducted, which demonstrated that our proposed changes increased the test set mIoU of GrowSP‑ForMS by 29.4 percentage points (pp) in comparison to the original GrowSP model and that utilizing MS data improved the mIoU by 5.6 pp from the monospectral case.

Abstract:
Traveling waves of neural activity are widely observed in the brain, but their precise computational function remains unclear. One prominent hypothesis is that they enable the transfer and integration of spatial information across neural populations. However, few computational models have explored how traveling waves might be harnessed to perform such integrative processing. Drawing inspiration from the famous "Can one hear the shape of a drum?" problem ‑‑ which highlights how normal modes of wave dynamics encode geometric information ‑‑ we investigate whether similar principles can be leveraged in artificial neural networks. Specifically, we introduce convolutional recurrent neural networks that learn to produce traveling waves in their hidden states in response to visual stimuli, enabling spatial integration. By then treating these wave‑like activation sequences as visual representations themselves, we obtain a powerful representational space that outperforms local feed‑forward networks on tasks requiring global spatial context. In particular, we observe that traveling waves effectively expand the receptive field of locally connected neurons, supporting long‑range encoding and communication of information. We demonstrate that models equipped with this mechanism solve visual semantic segmentation tasks demanding global integration, significantly outperforming local feed‑forward models and rivaling non‑local U‑Net models with fewer parameters. As a first step toward traveling‑wave‑based communication and visual representation in artificial networks, our findings suggest wave‑dynamics may provide efficiency and training stability benefits, while simultaneously offering a new framework for connecting models to biological recordings of neural activity.

Abstract:
Crafting magic and illusions is one of the most thrilling aspects of filmmaking, with visual effects (VFX) serving as the powerhouse behind unforgettable cinematic experiences. While recent advances in generative artificial intelligence have driven progress in generic image and video synthesis, the domain of controllable VFX generation remains relatively underexplored. In this work, we propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user‑friendly textual descriptions and static reference images. Our work makes two primary contributions: (i) Open‑VFX, the first high‑quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, instance segmentation masks for spatial conditioning, and start‑end timestamps for temporal control. (ii) VFX Creator, a simple yet effective controllable VFX generation framework based on a Video Diffusion Transformer. The model incorporates a spatial and temporal controllable LoRA adapter, requiring minimal training videos. Specifically, a plug‑and‑play mask control module enables instance‑level spatial manipulation, while tokenized start‑end motion timestamps embedded in the diffusion process, alongside the text encoder, allow precise temporal control over effect timing and pace. Extensive experiments on the Open‑VFX test set demonstrate the superiority of the proposed system in generating realistic and dynamic effects, achieving state‑of‑the‑art performance and generalization ability in both spatial and temporal controllability. Furthermore, we introduce a specialized metric to evaluate the precision of temporal control. By bridging traditional VFX techniques with generative approaches, VFX Creator unlocks new possibilities for efficient and high‑quality video effect generation, making advanced VFX accessible to a broader audience.

Abstract:
In this paper, we address the task of semantic segmentation of legal documents through rhetorical role classification, with a focus on Indian legal judgments. We introduce LegalSeg, the largest annotated dataset for this task, comprising over 7,000 documents and 1.4 million sentences, labeled with 7 rhetorical roles. To benchmark performance, we evaluate multiple state‑of‑the‑art models, including Hierarchical BiLSTM‑CRF, TransformerOverInLegalBERT (ToInLegalBERT), Graph Neural Networks (GNNs), and Role‑Aware Transformers, alongside an exploratory RhetoricLLaMA, an instruction‑tuned large language model. Our results demonstrate that models incorporating broader context, structural relationships, and sequential sentence information outperform those relying solely on sentence‑level features. Additionally, we conducted experiments using surrounding context and predicted or actual labels of neighboring sentences to assess their impact on classification accuracy. Despite these advancements, challenges persist in distinguishing between closely related roles and addressing class imbalance. Our work underscores the potential of advanced techniques for improving legal document understanding and sets a strong foundation for future research in legal NLP.

Abstract:
This study demonstrates a novel use of the U‑Net architecture in the field of semantic segmentation to detect landforms using preprocessed satellite imagery. The study applies the U‑Net model for effective feature extraction by using Convolutional Neural Network (CNN) segmentation techniques. Dropout is strategically used for regularization to improve the model's perseverance, and the Adam optimizer is used for effective training. The study thoroughly assesses the performance of the U‑Net architecture utilizing a large sample of preprocessed satellite topographical images. The model excels in semantic segmentation tasks, displaying high‑resolution outputs, quick feature extraction, and flexibility to a wide range of applications. The findings highlight the U‑Net architecture's substantial contribution to the advancement of machine learning and image processing technologies. The U‑Net approach, which emphasizes pixel‑wise categorization and comprehensive segmentation map production, is helpful in practical applications such as autonomous driving, disaster management, and land use planning. This study not only investigates the complexities of U‑Net architecture for semantic segmentation, but also highlights its real‑world applications in image classification, analysis, and landform identification. The study demonstrates the U‑Net model's key significance in influencing the environment of modern technology.

Abstract:
Segmentation of 3D medical images is a critical task for accurate diagnosis and treatment planning. Convolutional neural networks (CNNs) have dominated the field, achieving significant success in 3D medical image segmentation. However, CNNs struggle with capturing long‑range dependencies and global context, limiting their performance, particularly for fine and complex structures. Recent transformer‑based models, such as TransUNet and nnFormer, have demonstrated promise in addressing these limitations, though they still rely on hybrid CNN‑transformer architectures. This paper introduces a novel, fully convolutional‑free model based on transformer architecture and self‑attention mechanisms for 3D medical image segmentation. Our approach focuses on improving multi‑semantic segmentation accuracy and addressing domain adaptation challenges between thick and thin slice CT images. We propose a joint loss function that facilitates effective segmentation of thin slices based on thick slice annotations, overcoming limitations in dataset availability. Furthermore, we present a benchmark dataset for multi‑semantic segmentation on thin slices, addressing a gap in current medical imaging research. Our experiments demonstrate the superiority of the proposed model over traditional and hybrid architectures, offering new insights into the future of convolution‑free medical image segmentation.

Abstract:
Class incremental learning aims to enable models to learn from sequential, non‑stationary data streams across different tasks without catastrophic forgetting. In class incremental semantic segmentation (CISS), the semantic content of image pixels evolves over incremental phases, known as semantic drift. In this work, we identify two critical challenges in CISS that contribute to semantic drift and degrade performance. First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages, leading to misaligned probability scales. Second, we identify noisy semantics arising from inappropriate pseudo‑labeling, which results in sub‑optimal results. To address these challenges, we propose a novel and effective approach, Image Posterior and Semantics Decoupling for Segmentation (IPSeg). IPSeg introduces two key mechanisms: (1) leveraging image posterior probabilities to align optimization across stages and mitigate the effects of separate optimization, and (2) employing semantics decoupling to handle noisy semantics and tailor learning strategies for different semantics. Extensive experiments on the Pascal VOC 2012 and ADE20K datasets demonstrate that IPSeg achieves superior performance compared to state‑of‑the‑art methods, particularly in challenging long‑term incremental scenarios.

Abstract:
Post‑training quantization (PTQ) has emerged as a promising solution for reducing the storage and computational cost of vision transformers (ViTs). Recent advances primarily target at crafting quantizers to deal with peculiar activations characterized by ViTs. However, most existing methods underestimate the information loss incurred by weight quantization, resulting in significant performance deterioration, particularly in low‑bit cases. Furthermore, a common practice in quantizing post‑Softmax activations of ViTs is to employ logarithmic transformations, which unfortunately prioritize less informative values around zero. This approach introduces additional redundancies, ultimately leading to suboptimal quantization efficacy. To handle these, this paper proposes an innovative PTQ method tailored for ViTs, termed AIQViT (Architecture‑Informed Post‑training Quantization for ViTs). First, we design an architecture‑informed low rank compensation mechanism, wherein learnable low‑rank weights are introduced to compensate for the degradation caused by weight quantization. Second, we design a dynamic focusing quantizer to accommodate the unbalanced distribution of post‑Softmax activations, which dynamically selects the most valuable interval for higher quantization resolution. Extensive experiments on five vision tasks, including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation, demonstrate the superiority of AIQViT over state‑of‑the‑art PTQ methods.

Abstract:
Ensuring the robustness of deep learning models requires comprehensive and diverse testing. Existing approaches, often based on simple data augmentation techniques or generative adversarial networks, are limited in producing realistic and varied test cases. To address these limitations, we present a novel framework for testing vision neural networks that leverages Large Language Models and control‑conditioned Diffusion Models to generate synthetic, high‑fidelity test cases. Our approach begins by translating images into detailed textual descriptions using a captioning model, allowing the language model to identify modifiable aspects of the image and generate counterfactual descriptions. These descriptions are then used to produce new test images through a text‑to‑image diffusion process that preserves spatial consistency and maintains the critical elements of the scene. We demonstrate the effectiveness of our method using two datasets: ImageNet1K for image classification and SHIFT for semantic segmentation in autonomous driving. The results show that our approach can generate significant test cases that reveal weaknesses and improve the robustness of the model through targeted retraining. We conducted a human assessment using Mechanical Turk to validate the generated images. The responses from the participants confirmed, with high agreement among the voters, that our approach produces valid and realistic images.

Abstract:
We present a validation dataset of newly‑collected kitchen‑based egocentric videos, manually annotated with highly detailed and interconnected ground‑truth labels covering: recipe steps, fine‑grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in‑the‑wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly‑detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine‑grained actions, 3D perception, object motion, and gaze direction. The powerful long‑context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long‑term video‑object segmentation on HD‑EPIC. HD‑EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine‑grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.

Abstract:
3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer‑based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer‑based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent‑Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent‑Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.

Abstract:
In this paper, we propose an adaptive margin contrastive learning method for 3D point cloud semantic segmentation, namely AMContrast3D. Most existing methods use equally penalized objectives, which ignore per‑point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub‑optimal models. To address this, we design adaptive objectives for individual points based on their ambiguity levels, aiming to ensure the correctness of low‑ambiguity points while allowing mistakes for high‑ambiguity points. Specifically, we first estimate ambiguities based on position embeddings. Then, we develop a margin generator to shift decision boundaries for contrastive feature embeddings, so margins are narrowed due to increasing ambiguities with even negative margins for extremely high‑ambiguity points. Experimental results on large‑scale datasets, S3DIS and ScanNet, demonstrate that our method outperforms state‑of‑the‑art methods.

Abstract:
Availability of datasets is a strong driver for research on 3D semantic understanding, and whilst obtaining unlabeled 3D point cloud data is straightforward, manually annotating this data with semantic labels is time‑consuming and costly. Recently, Vision Foundation Models (VFMs) enable open‑set semantic segmentation on camera images, potentially aiding automatic labeling. However,VFMs for 3D data have been limited to adaptations of 2D models, which can introduce inconsistencies to 3D labels. This work introduces Label Any Pointcloud (LeAP), leveraging 2D VFMs to automatically label 3D data with any set of classes in any kind of application whilst ensuring label consistency. Using a Bayesian update, point labels are combined into voxels to improve spatio‑temporal consistency. A novel 3D Consistency Network (3D‑CN) exploits 3D information to further improve label quality. Through various experiments, we show that our method can generate high‑quality 3D semantic labels across diverse fields without any manual labeling. Further, models adapted to new domains using our labels show up to a 34.2 mIoU increase in semantic segmentation tasks.

Abstract:
Semantic segmentation is one of the core tasks in the field of computer vision, and its goal is to accurately classify each pixel in an image. The traditional Unet model achieves efficient feature extraction and fusion through an encoder‑decoder structure, but it still has certain limitations when dealing with complex backgrounds, long‑distance dependencies, and multi‑scale targets. To this end, this paper proposes an improved Unet model combined with an attention mechanism, introduces channel attention and spatial attention modules, enhances the model's ability to focus on important features, and optimizes skip connections through a multi‑scale feature fusion strategy, thereby improving the combination of global semantic information and fine‑grained features. The experiment is based on the Cityscapes dataset and compared with classic models such as FCN, SegNet, DeepLabv3+, and PSPNet. The improved model performs well in terms of mIoU and pixel accuracy (PA), reaching 76.5% and 95.3% respectively. The experimental results verify the superiority of this method in dealing with complex scenes and blurred target boundaries. In addition, this paper discusses the potential of the improved model in practical applications and future expansion directions, indicating that it has broad application value in fields such as autonomous driving, remote sensing image analysis, and medical image processing.

Abstract:
Activation functions are fundamental elements of deep learning architectures as they significantly influence training dynamics. ReLU, while widely used, is prone to the dying neuron problem, which has been mitigated by variants such as LeakyReLU, PReLU, and ELU that better handle negative neuron outputs. Recently, self‑gated activations like GELU and Swish have emerged as state‑of‑the‑art alternatives, leveraging their smoothness to ensure stable gradient flow and prevent neuron inactivity. In this work, we introduce the Gompertz Linear Unit (GoLU), a novel self‑gated activation function defined as \mathrmGoLU(x) = x \, \mathrmGompertz(x), where \mathrmGompertz(x) = e^‑e^‑x. The GoLU activation leverages the right‑skewed asymmetry in the Gompertz function to reduce variance in the latent space more effectively compared to GELU and Swish, while preserving robust gradient flow. Extensive experiments across diverse tasks, including Image Classification, Language Modeling, Semantic Segmentation, Object Detection, Instance Segmentation, and Diffusion, highlight GoLU's superior performance relative to state‑of‑the‑art activation functions, establishing GoLU as a robust alternative to existing activation functions.

Abstract:
Vision‑language models like CLIP excel at recognizing the single, prominent object in a scene. However, they struggle in complex scenes containing multiple objects. We identify a fundamental reason for this limitation: VLM feature space exhibits excessive mutual feature information (MFI), where the features of one class contain substantial information about other, unrelated classes. This high MFI becomes evident during class‑specific queries, as unrelated objects are activated alongside the queried class. To address this limitation, we propose DCLIP, an efficient framework that learns an optimal level of mutual information while adding only minimal learnable parameters to a frozen VLM. DCLIP uses two complementary losses: a novel MFI Loss that regulates class feature similarity to prevent excessive overlap while preserving necessary shared information, and the Asymmetric Loss (ASL) that aligns image features with the disentangled text features. Through this disentanglement, DCLIP reduces excessive inter‑class similarity by 30%. On multi‑label recognition, DCLIP performs favorably over SOTA approaches on VOC2007 and COCO‑14 while using 75% fewer training parameters. For zero‑shot semantic segmentation, it shows improved performance across six benchmark datasets. These results highlight the importance of feature disentanglement for multi‑object perception in VLMs.

Abstract:
The integration of Artificial Intelligence (AI) and Machine Learning (ML) in next‑generation wireless communication systems has become a cornerstone for advancing intelligent, adaptive, and scalable networks. This reading report examines key innovations in dynamic spectrum sensing (DSS), beginning with the foundational DeepSense framework, which uses convolutional neural networks (CNNs) and spectrogram‑based analysis for real‑time wideband spectrum monitoring. Building on this groundwork, it highlights advancements such as DeepSweep and Wideband Signal Stitching, which address the challenges of scalability, latency, and dataset diversity through parallel processing, semantic segmentation, and robust data augmentation strategies. The report then explores Open Radio Access Networks (ORAN), focusing on AI/ML‑driven enhancements for UAV experimentation, digital twin‑based optimization, network slicing, and self‑healing xApp development. By bridging AI‑based DSS methodologies with ORAN's open, vendor‑neutral architecture, these studies underscore the potential of software‑defined, intelligent infrastructures in enabling efficient, resilient, and self‑optimizing networks for 5G/6G ecosystems. Through this synthesis, the report highlights AI's transformative role in shaping the future of wireless communication and autonomous systems.

Abstract:
Current state‑of‑the‑art segmentation models encode entire images before focusing on specific objects. As a result, they waste computational resources ‑ particularly when small objects are to be segmented in high‑resolution scenes. We introduce FLIP (Fovea‑Like Input Patching), a parameter‑efficient vision model that realizes object segmentation through biologically‑inspired top‑down attention. FLIP selectively samples multi‑resolution patches centered on objects of interest from the input. As a result, it allocates high‑resolution processing to object centers while maintaining coarser peripheral context. This off‑grid, scale‑invariant design enables FLIP to outperform META's Segment Anything models (SAM) by large margins: With more than 1000x fewer parameters, FLIP‑Tiny (0.51M parameters) reaches a mean IoU of 78.24% while SAM‑H reaches 75.41% IoU (641.1M parameters). FLIP‑Large even achieves 80.33% mean IoU (96.6M parameters), still running about 6× faster than SAM‑H. We evaluate on six benchmarks in total. In five established benchmarks (Hypersim, KITTI‑360, OpenImages, COCO, LVIS) FLIP consistently outperforms SAM and various variants of it. In our novel ObjaScale dataset, which stress‑tests scale invariance with objects ranging from 0.0001% up‑to 25% of the image area, we show that FLIP segments even very small objects accurately, where existing models fail severely. FLIP opens new possibilities for real‑time, object‑centric vision applications and offers much higher energy efficiency. We believe that FLIP can act as a powerful foundation model, as it is very well‑suited to track objects over time, for example, when being integrated into slot‑based scene segmentation architectures.

Abstract:
The civil engineering industry faces a critical need for innovative non‑destructive evaluation methods, particularly for ageing critical infrastructure, such as bridges, where current techniques fall short. Muography, a non‑invasive imaging technique, constructs three‑dimensional density maps by detecting interactions of naturally occurring cosmic‑ray muons within the scanned volume. Cosmic‑ray muons provide deep penetration and inherent safety due to their high momenta and natural source. However, the technology's reliance on this source results in constrained muon flux, leading to prolonged acquisition times, noisy reconstructions and image interpretation challenges. To address these limitations, we developed a two‑model deep learning approach. First, we employed a conditional Wasserstein generative adversarial network with gradient penalty (cWGAN‑GP) to perform predictive upsampling of undersampled muography images. Using the Structural Similarity Index Measure (SSIM), 1‑day sampled images matched the perceptual qualities of a 21‑day image, while the Peak Signal‑to‑Noise Ratio (PSNR) indicated noise improvement equivalent to 31 days of sampling. A second cWGAN‑GP model, trained for semantic segmentation, quantitatively assessed the upsampling model's impact on concrete sample features. This model achieved segmentation of rebar grids and tendon ducts, with Dice‑Sørensen accuracy coefficients of 0.8174 and 0.8663. Notably, it could mitigate or remove z‑plane smearing artifacts caused by muography's inverse imaging problem. Both models were trained on a comprehensive Geant4 Monte‑Carlo simulation dataset reflecting realistic civil infrastructure scenarios. Our results demonstrate significant improvements in acquisition speed and image quality, marking a substantial step toward making muography more practical for reinforced concrete infrastructure monitoring applications.

Abstract:
Effective self‑supervised learning (SSL) techniques have been key to unlocking large datasets for representation learning. While many promising methods have been developed using online corpora and captioned photographs, their application to scientific domains, where data encodes highly specialized knowledge, remains a challenge. Liquid Argon Time Projection Chambers (LArTPCs) provide high‑resolution 3D imaging for fundamental physics, but analysis of their sparse, complex point cloud data often relies on supervised methods trained on large simulations, introducing potential biases. We introduce the Point‑based Liquid Argon Masked Autoencoder (PoLAr‑MAE), applying masked point modeling to unlabeled LArTPC images using domain‑specific volumetric tokenization and energy prediction. We show this SSL approach learns physically meaningful trajectory representations directly from data. This yields remarkable data efficiency: fine‑tuning on just 100 labeled events achieves track/shower semantic segmentation performance comparable to the state‑of‑the‑art supervised baseline trained on >100,000 events. Furthermore, internal attention maps exhibit emergent instance segmentation of particle trajectories. While challenges remain, particularly for fine‑grained features, we make concrete SSL's potential for building a foundation model for LArTPC image analysis capable of serving as a common base for all data reconstruction tasks. To facilitate further progress, we release PILArNet‑M, a large dataset of 1M LArTPC events. Project site: https://youngsm.com/polarmae.

Abstract:
Recent advancements in foundation models have transformed computer vision, driving significant performance improvements across diverse domains, including digital histopathology. However, the advantages of domain‑specific histopathology foundation models over general‑purpose models for specialized tasks such as cell analysis remain underexplored. This study investigates the representation learning gap between these two categories by analyzing multi‑level patch embeddings applied to cell instance segmentation and classification. We implement an encoder‑decoder architecture with a consistent decoder and various encoders. These include convolutional, vision transformer (ViT), and hybrid encoders pre‑trained on ImageNet‑22K or LVD‑142M, representing general‑purpose foundation models. These are compared against ViT encoders from the recently released UNI, Virchow2, and Prov‑GigaPath foundation models, trained on patches extracted from hundreds of thousands of histopathology whole‑slide images. The decoder integrates patch embeddings from different encoder depths via skip connections to generate semantic and distance maps. These maps are then post‑processed to create instance segmentation masks where each label corresponds to an individual cell and to perform cell‑type classification. All encoders remain frozen during training to assess their pre‑trained feature extraction capabilities. Using the PanNuke and CoNIC histopathology datasets, and the newly introduced Nissl‑stained CytoDArk0 dataset for brain cytoarchitecture studies, we evaluate instance‑level detection, segmentation accuracy, and cell‑type classification. This study provides insights into the comparative strengths and limitations of general‑purpose vs. histopathology foundation models, offering guidance for model selection in cell‑focused histopathology and brain cytoarchitecture analysis workflows.

Abstract:
How to mitigate negative transfer in transfer learning is a long‑standing and challenging issue, especially in the application of medical image segmentation. Existing methods for reducing negative transfer focus on classification or regression tasks, ignoring the non‑uniform negative transfer risk in different image regions. In this work, we propose a simple yet effective weighted fine‑tuning method that directs the model's attention towards regions with significant transfer risk for medical semantic segmentation. Specifically, we compute a transferability‑guided transfer risk map to quantify the transfer hardness for each pixel and the potential risks of negative transfer. During the fine‑tuning phase, we introduce a map‑weighted loss function, normalized with image foreground size to counter class imbalance. Extensive experiments on brain segmentation datasets show our method significantly improves the target task performance, with gains of 4.37% on FeTS2021 and 1.81% on iSeg2019, avoiding negative transfer across modalities and tasks. Meanwhile, a 2.9% gain under a few‑shot scenario validates the robustness of our approach.

Abstract:
Segmentation of brain tumors is a critical step in treatment planning, yet manual segmentation is both time‑consuming and subjective, relying heavily on the expertise of radiologists. In Sub‑Saharan Africa, this challenge is magnified by overburdened medical systems and limited access to advanced imaging modalities and expert radiologists. Automating brain tumor segmentation using deep learning offers a promising solution. Convolutional Neural Networks (CNNs), especially the U‑Net architecture, have shown significant potential. However, a major challenge remains: achieving generalizability across different datasets. This study addresses this gap by developing a deep learning ensemble that integrates UNet3D, V‑Net, and MSA‑VNet models for the semantic segmentation of gliomas. By initially training on the BraTS‑GLI dataset and fine‑tuning with the BraTS‑SSA dataset, we enhance model performance. Our ensemble approach significantly outperforms individual models, achieving DICE scores of 0.8358 for Tumor Core, 0.8521 for Whole Tumor, and 0.8167 for Enhancing Tumor. These results underscore the potential of ensemble methods in improving the accuracy and reliability of automated brain tumor segmentation, particularly in resource‑limited settings.

Abstract:
While current Vision Transformer (ViT) adapter methods have shown promising accuracy, their inference speed is implicitly hindered by inefficient memory access operations, e.g., standard normalization and frequent reshaping. In this work, we propose META, a simple and fast ViT adapter that can improve the model's memory efficiency and decrease memory time consumption by reducing the inefficient memory access operations. Our method features a memory‑efficient adapter block that enables the common sharing of layer normalization between the self‑attention and feed‑forward network layers, thereby reducing the model's reliance on normalization operations. Within the proposed block, the cross‑shaped self‑attention is employed to reduce the model's frequent reshaping operations. Moreover, we augment the adapter block with a lightweight convolutional branch that can enhance local inductive biases, particularly beneficial for the dense prediction tasks, e.g., object detection, instance segmentation, and semantic segmentation. The adapter block is finally formulated in a cascaded manner to compute diverse head features, thereby enriching the variety of feature representations. Empirically, extensive evaluations on multiple representative datasets validate that META substantially enhances the predicted quality, while achieving a new state‑of‑the‑art accuracy‑efficiency trade‑off. Theoretically, we demonstrate that META exhibits superior generalization capability and stronger adaptability.

Abstract:
Multi‑modal image fusion synthesizes information from multiple sources into a single image, facilitating downstream tasks such as semantic segmentation. Current approaches primarily focus on acquiring informative fusion images at the visual display stratum through intricate mappings. Although some approaches attempt to jointly optimize image fusion and downstream tasks, these efforts often lack direct guidance or interaction, serving only to assist with a predefined fusion loss. To address this, we propose an ``Unfolding Attribution Analysis Fusion network'' (UAAFusion), using attribution analysis to tailor fused images more effectively for semantic segmentation, enhancing the interaction between the fusion and segmentation. Specifically, we utilize attribution analysis techniques to explore the contributions of semantic regions in the source images to task discrimination. At the same time, our fusion algorithm incorporates more beneficial features from the source images, thereby allowing the segmentation to guide the fusion process. Our method constructs a model‑driven unfolding network that uses optimization objectives derived from attribution analysis, with an attribution fusion loss calculated from the current state of the segmentation network. We also develop a new pathway function for attribution analysis, specifically tailored to the fusion tasks in our unfolding network. An attribution attention mechanism is integrated at each network stage, allowing the fusion network to prioritize areas and pixels crucial for high‑level recognition tasks. Additionally, to mitigate the information loss in traditional unfolding networks, a memory augmentation module is incorporated into our network to improve the information flow across various network layers. Extensive experiments demonstrate our method's superiority in image fusion and applicability to semantic segmentation.

Abstract:
In industrial settings, weakly supervised (WS) methods are usually preferred over their fully supervised (FS) counterparts as they do not require costly manual annotations. Unfortunately, the segmentation masks obtained in the WS regime are typically poor in terms of accuracy. In this work, we present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. More specifically, we build saliency maps that exploit the temporal coherence between consecutive frames in a video, promoting consistency when objects appear in different frames. We apply our method in a waste‑sorting scenario, where we perform weakly supervised video segmentation (WSVS) by training an auxiliary classifier that distinguishes between videos recorded before and after a human operator, who manually removes specific wastes from a conveyor belt. The saliency maps of this classifier identify materials to be removed, and we modify the classifier training to minimize differences between the saliency map of a central frame and those in adjacent frames, after having compensated object displacement. Experiments on a real‑world dataset demonstrate the benefits of integrating temporal coherence directly during the training phase of the classifier. Code and dataset are available upon request.

Abstract:
While traditional self‑supervised learning methods improve performance and robustness across various medical tasks, they rely on single‑vector embeddings that may not capture fine‑grained concepts such as anatomical structures or organs. The ability to identify such concepts and their characteristics without supervision has the potential to improve pre‑training methods, and enable novel applications such as fine‑grained image retrieval and concept‑based outlier detection. In this paper, we introduce ConceptVAE, a novel pre‑training framework that detects and disentangles fine‑grained concepts from their style characteristics in a self‑supervised manner. We present a suite of loss terms and model architecture primitives designed to discretise input data into a preset number of concepts along with their local style. We validate ConceptVAE both qualitatively and quantitatively, demonstrating its ability to detect fine‑grained anatomical structures such as blood pools and septum walls from 2D cardiac echocardiographies. Quantitatively, ConceptVAE outperforms traditional self‑supervised methods in tasks such as region‑based instance retrieval, semantic segmentation, out‑of‑distribution detection, and object detection. Additionally, we explore the generation of in‑distribution synthetic data that maintains the same concepts as the training data but with distinct styles, highlighting its potential for more calibrated data generation. Overall, our study introduces and validates a promising new pre‑training technique based on concept‑style disentanglement, opening multiple avenues for developing models for medical image analysis that are more interpretable and explainable than black‑box approaches.

Abstract:
Multi‑modal 3D semantic segmentation is vital for applications such as autonomous driving and virtual reality (VR). To effectively deploy these models in real‑world scenarios, it is essential to employ cross‑domain adaptation techniques that bridge the gap between training data and real‑world data. Recently, self‑training with pseudo‑labels has emerged as a predominant method for cross‑domain adaptation in multi‑modal 3D semantic segmentation. However, generating reliable pseudo‑labels necessitates stringent constraints, which often result in sparse pseudo‑labels after pruning. This sparsity can potentially hinder performance improvement during the adaptation process. We propose an image‑guided pseudo‑label enhancement approach that leverages the complementary 2D prior knowledge from the Segment Anything Model (SAM) to introduce more reliable pseudo‑labels, thereby boosting domain adaptation performance. Specifically, given a 3D point cloud and the SAM masks from its paired image data, we collect all 3D points covered by each SAM mask that potentially belong to the same object. Then our method refines the pseudo‑labels within each SAM mask in two steps. First, we determine the class label for each mask using majority voting and employ various constraints to filter out unreliable mask labels. Next, we introduce Geometry‑Aware Progressive Propagation (GAPP) which propagates the mask label to all 3D points within the SAM mask while avoiding outliers caused by 2D‑3D misalignment. Experiments conducted across multiple datasets and domain adaptation scenarios demonstrate that our proposed method significantly increases the quantity of high‑quality pseudo‑labels and enhances the adaptation performance over baseline methods.

Abstract:
We introduce Lifting By Gaussians (LBG), a novel approach for open‑world instance segmentation of 3D Gaussian Splatted Radiance Fields (3DGS). Recently, 3DGS Fields have emerged as a highly efficient and explicit alternative to Neural Field‑based methods for high‑quality Novel View Synthesis. Our 3D instance segmentation method directly lifts 2D segmentation masks from SAM (alternately FastSAM, etc.), together with features from CLIP and DINOv2, directly fusing them onto 3DGS (or similar Gaussian radiance fields such as 2DGS). Unlike previous approaches, LBG requires no per‑scene training, allowing it to operate seamlessly on any existing 3DGS reconstruction. Our approach is not only an order of magnitude faster and simpler than existing approaches; it is also highly modular, enabling 3D semantic segmentation of existing 3DGS fields without requiring a specific parametrization of the 3D Gaussians. Furthermore, our technique achieves superior semantic segmentation for 2D semantic novel view synthesis and 3D asset extraction results while maintaining flexibility and efficiency. We further introduce a novel approach to evaluate individually segmented 3D assets from 3D radiance field segmentation methods.

Abstract:
The Great Outdoors (GO) dataset is a multi‑modal annotated data resource aimed at advancing ground robotics research in unstructured environments. This dataset provides the most comprehensive set of data modalities and annotations compared to existing off‑road datasets. In total, the GO dataset includes six unique sensor types with high‑quality semantic annotations and GPS traces to support tasks such as semantic segmentation, object detection, and SLAM. The diverse environmental conditions represented in the dataset present significant real‑world challenges that provide opportunities to develop more robust solutions to support the continued advancement of field robotics, autonomous exploration, and perception systems in natural environments. The dataset can be downloaded at: https://www.unmannedlab.org/the‑great‑outdoors‑dataset/

Abstract:
Recent advances in deep learning have shown that learning robust feature representations is critical for the success of many computer vision tasks, including medical image segmentation. In particular, both transformer and convolutional‑based architectures have benefit from leveraging pretext tasks for pretraining. However, the adoption of pretext tasks in 3D medical imaging has been less explored and remains a challenge, especially in the context of learning generalizable feature representations. We propose a novel pretraining strategy using diffusion models with anatomical guidance, tailored to the intricacies of 3D medical image data. We introduce an auxiliary diffusion process to pretrain a model that produce generalizable feature representations, useful for a variety of downstream segmentation tasks. We employ an additional model that predicts 3D universal body‑part coordinates, providing guidance during the diffusion process and improving spatial awareness in generated representations. This approach not only aids in resolving localization inaccuracies but also enriches the model's ability to understand complex anatomical structures. Empirical validation on a 13‑class organ segmentation task demonstrate the effectiveness of our pretraining technique. It surpasses existing restorative pretraining methods in 3D medical image segmentation by 7.5%, and is competitive with the state‑of‑the‑art contrastive pretraining approach, achieving an average Dice coefficient of 67.8 in a non‑linear evaluation scenario.

Abstract:
Semantic segmentation assigns labels to pixels in images, a critical yet challenging task in computer vision. Convolutional methods, although capturing local dependencies well, struggle with long‑range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands, especially for high‑resolution inputs. Most research optimizes the encoder architecture, leaving the bottleneck underexplored ‑ a key area for enhancing performance and efficiency. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real‑time semantic segmentation. The framework's efficiency is driven by three synergistic modules: the Token Pyramid Extraction Module (TPEM) for hierarchical multi‑scale representation, the Transformer and Branched DepthwiseConv (Trans‑BDC) block for dynamic scale‑aware feature modeling, and the Feature Merging Module (FMM) for robust integration with enhanced spatial and contextual consistency. Extensive experiments on ADE20K, Pascal Context, CityScapes, and COCO‑Stuff datasets show ContextFormer significantly outperforms existing models, achieving state‑of‑the‑art mIoU scores, setting a new benchmark for efficiency and performance. The codes will be made publicly available upon acceptance.

Abstract:
In this paper, we propose a novel active learning approach integrated with an improved semi‑supervised learning framework to reduce the cost of manual annotation and enhance model performance. Our proposed approach effectively leverages both the labelled data selected through active learning and the unlabelled data excluded from the selection process. The proposed active learning approach pinpoints areas where the pseudo‑labels are likely to be inaccurate. Then, an automatic and efficient pseudo‑label auto‑refinement (PLAR) module is proposed to correct pixels with potentially erroneous pseudo‑labels by comparing their feature representations with those of labelled regions. This approach operates without increasing the labelling budget and is based on the cluster assumption, which states that pixels belonging to the same class should exhibit similar representations in feature space. Furthermore, manual labelling is only applied to the most difficult and uncertain areas in unlabelled data, where insufficient information prevents the PLAR module from making a decision. We evaluated the proposed hybrid semi‑supervised active learning framework on two benchmark datasets, one from natural and the other from remote sensing imagery domains. In both cases, it outperformed state‑of‑the‑art methods in the semantic segmentation task.

Abstract:
Most existing RGB‑D semantic segmentation methods focus on the feature level fusion, including complex cross‑modality and cross‑scale fusion modules. However, these methods may cause misalignment problem in the feature fusion process and counter‑intuitive patches in the segmentation results. Inspired by the popular pixel‑node‑pixel pipeline, we propose to 1) fuse features from two modalities in a late fusion style, during which the geometric feature injection is guided by texture feature prior; 2) employ Graph Neural Networks (GNNs) on the fused feature to alleviate the emergence of irregular patches by inferring patch relationship. At the 3D feature extraction stage, we argue that traditional CNNs are not efficient enough for depth maps. So, we encode depth map into normal map, after which CNNs can easily extract object surface tendencies.At projection matrix generation stage, we find the existence of Biased‑Assignment and Ambiguous‑Locality issues in the original pipeline. Therefore, we propose to 1) adopt the Kullback‑Leibler Loss to ensure no missing important pixel features, which can be viewed as hard pixel mining process; 2) connect regions that are close to each other in the Euclidean space as well as in the semantic space with larger edge weights so that location informations can been considered. Extensive experiments on two public datasets, NYU‑DepthV2 and SUN RGB‑D, have shown that our approach can consistently boost the performance of RGB‑D semantic segmentation task.

Abstract:
Existing concealed object segmentation (COS) methods frequently utilize reversible strategies to address uncertain regions. However, these approaches are typically restricted to the mask domain, leaving the potential of the RGB domain underexplored. To address this, we propose the Reversible Unfolding Network (RUN), which applies reversible strategies across both mask and RGB domains through a theoretically grounded framework, enabling accurate segmentation. RUN first formulates a novel COS model by incorporating an extra residual sparsity constraint to minimize segmentation uncertainties. The iterative optimization steps of the proposed model are then unfolded into a multistage network, with each step corresponding to a stage. Each stage of RUN consists of two reversible modules: the Segmentation‑Oriented Foreground Separation (SOFS) module and the Reconstruction‑Oriented Background Extraction (ROBE) module. SOFS applies the reversible strategy at the mask level and introduces Reversible State Space to capture non‑local information. ROBE extends this to the RGB domain, employing a reconstruction network to address conflicting foreground and background regions identified as distortion‑prone areas, which arise from their separate estimation by independent modules. As the stages progress, RUN gradually facilitates reversible modeling of foreground and background in both the mask and RGB domains, directing the network's attention to uncertain regions and mitigating false‑positive and false‑negative results. Extensive experiments demonstrate the superior performance of RUN and highlight the potential of unfolding‑based frameworks for COS and other high‑level vision tasks. We will release the code and models.

Abstract:
Task‑generic promptable image segmentation aims to achieve segmentation of diverse samples under a single task description by utilizing only one task‑generic prompt. Current methods leverage the generalization capabilities of Vision‑Language Models (VLMs) to infer instance‑specific prompts from these task‑generic prompts in order to guide the segmentation process. However, when VLMs struggle to generalise to some image instances, predicting instance‑specific prompts becomes poor. To solve this problem, we introduce Instance‑specific Negative Mining for Task‑Generic Promptable Segmentation (INT). The key idea of INT is to adaptively reduce the influence of irrelevant (negative) prior knowledge whilst to increase the use the most plausible prior knowledge, selected by negative mining with higher contrast, in order to optimise instance‑specific prompts generation. Specifically, INT consists of two components: (1) instance‑specific prompt generation, which progressively fliters out incorrect information in prompt generation; (2) semantic mask generation, which ensures each image instance segmentation matches correctly the semantics of the instance‑specific prompts. INT is validated on six datasets, including camouflaged objects and medical images, demonstrating its effectiveness, robustness and scalability.

Abstract:
The communication scenarios and channel characteristics of 6G will be more complex and difficult to characterize. Conventional methods for channel prediction face challenges in achieving an optimal balance between accuracy, practicality, and generalizability. Additionally, they often fail to effectively leverage environmental features. Within the framework of integration communication and artificial intelligence as a pivotal development vision for 6G, it is imperative to achieve intelligent prediction of channel characteristics. Vision‑aided methods have been employed in various wireless communication tasks, excluding channel prediction, and have demonstrated enhanced efficiency and performance. In this paper, we propose a vision‑aided two‑stage model for channel prediction in millimeter wave vehicular communication scenarios, realizing accurate received power prediction utilizing solely RGB images. Firstly, we obtain original images of propagation environment through an RGB camera. Secondly, three typical computer vision methods including object detection, instance segmentation and binary mask are employed for environmental information extraction from original images in stage 1, and prediction of received power based on processed images is implemented in stage 2. Pre‑trained YOLOv8 and ResNets are used in stages 1 and 2, respectively, and fine‑tuned on datasets. Finally, we conduct five experiments to evaluate the performance of proposed model, demonstrating its feasibility, accuracy and generalization capabilities. The model proposed in this paper offers novel solutions for achieving intelligent channel prediction in vehicular communications.

Abstract:
Vision foundation models have demonstrated exceptional generalization capabilities in segmentation tasks for both generic and specialized images. However, a performance gap persists between foundation models and task‑specific, specialized models. Fine‑tuning foundation models on downstream datasets is often necessary to bridge this gap. Unfortunately, obtaining fully annotated ground truth for downstream datasets is both challenging and costly. To address this limitation, we propose a novel test‑time training paradigm that enhances the performance of foundation models on downstream datasets without requiring full annotations. Specifically, our method employs simple point prompts to guide a test‑time semi‑self‑supervised training task. The model learns by resolving the ambiguity of the point prompt through various augmentations. This approach directly tackles challenges in the medical imaging field, where acquiring annotations is both time‑intensive and expensive. We conducted extensive experiments on our new Videofluoroscopy dataset (VFSS‑5k) for the instance segmentation task, achieving an average Dice coefficient of 0.868 across 12 anatomies with a single model.

Abstract:
This paper presents an analysis of utilizing elevation data to aid outdoor point cloud semantic segmentation through existing machine‑learning networks in remote sensing, specifically in urban, built‑up areas. In dense outdoor point clouds, the receptive field of a machine learning model may be too small to accurately determine the surroundings and context of a point. By computing Digital Terrain Models (DTMs) from the point clouds, we extract the relative elevation feature, which is the vertical distance from the terrain to a point. RandLA‑Net is employed for efficient semantic segmentation of large‑scale point clouds. We assess its performance across three diverse outdoor datasets captured with varying sensor technologies and sensor locations. Integration of relative elevation data leads to consistent performance improvements across all three datasets, most notably in the Hessigheim dataset, with an increase of 3.7 percentage points in average F1 score from 72.35% to 76.01%, by establishing long‑range dependencies between ground and objects. We also explore additional local features such as planarity, normal vectors, and 2D features, but their efficacy varied based on the characteristics of the point cloud. Ultimately, this study underscores the important role of the non‑local relative elevation feature for semantic segmentation of point clouds in remote sensing applications.

Abstract:
This paper presents Contourformer, a real‑time contour‑based instance segmentation algorithm. The method is fully based on the DETR paradigm and achieves end‑to‑end inference through iterative and progressive mechanisms to optimize contours. To improve efficiency and accuracy, we develop two novel techniques: sub‑contour decoupling mechanisms and contour fine‑grained distribution refinement. In the sub‑contour decoupling mechanism, we propose a deformable attention‑based module that adaptively selects sampling regions based on the current predicted contour, enabling more effective capturing of object boundary information. Additionally, we design a multi‑stage optimization process to enhance segmentation precision by progressively refining sub‑contours. The contour fine‑grained distribution refinement technique aims to further improve the ability to express fine details of contours. These innovations enable Contourformer to achieve stable and precise segmentation for each instance while maintaining real‑time performance. Extensive experiments demonstrate the superior performance of Contourformer on multiple benchmark datasets, including SBD, COCO, and KINS. We conduct comprehensive evaluations and comparisons with existing state‑of‑the‑art methods, showing significant improvements in both accuracy and inference speed. This work provides a new solution for contour‑based instance segmentation tasks and lays a foundation for future research, with the potential to become a strong baseline method in this field.

Abstract:
Open‑vocabulary semantic segmentation (OVSS) is an open‑world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions. While large‑scale vision‑language models have shown remarkable open‑vocabulary capabilities, their image‑level pretraining limits effectiveness on pixel‑wise dense prediction tasks like OVSS. Recent cost‑based methods narrow this granularity gap by constructing pixel‑text cost maps and refining them via cost aggregation mechanisms. Despite achieving promising performance, these approaches suffer from high computational costs and long inference latency. In this paper, we identify two major sources of redundancy in the cost‑based OVSS framework: redundant information introduced during cost maps construction and inefficient sequence modeling in cost aggregation. To address these issues, we propose ERR‑Seg, an efficient architecture that incorporates Redundancy‑Reduced Hierarchical Cost maps (RRHC) and Redundancy‑Reduced Cost Aggregation (RRCA). Specifically, RRHC reduces redundant class channels by customizing a compact class vocabulary for each image and integrates hierarchical cost maps to enrich semantic representation. RRCA alleviates computational burden by performing both spatial‑level and class‑level sequence reduction before aggregation. Overall, ERR‑Seg results in a lightweight structure for OVSS, characterized by substantial memory and computational savings without compromising accuracy. Compared to previous state‑of‑the‑art methods on the ADE20K‑847 benchmark, ERR‑Seg improves performance by 5.6% while achieving a 3.1× speedup.

Abstract:
Semantic segmentation of indoor point clouds has found various applications in the creation of digital twins for robotics, navigation and building information modeling (BIM). However, most existing datasets of labeled indoor point clouds have been acquired by photogrammetry. In contrast, Terrestrial Laser Scanning (TLS) can acquire dense sub‑centimeter point clouds and has become the standard for surveyors. We present 3DSES (3D Segmentation of ESGT point clouds), a new dataset of indoor dense TLS colorized point clouds covering 427 m 2 of an engineering school. 3DSES has a unique double annotation format: semantic labels annotated at the point level alongside a full 3D CAD model of the building. We introduce a model‑to‑cloud algorithm for automated labeling of indoor point clouds using an existing 3D CAD model. 3DSES has 3 variants of various semantic and geometrical complexities. We show that our model‑to‑cloud alignment can produce pseudo‑labels on our point clouds with a \> 95% accuracy, allowing us to train deep models with significant time savings compared to manual labeling. First baselines on 3DSES show the difficulties encountered by existing models when segmenting objects relevant to BIM, such as light and safety utilities. We show that segmentation accuracy can be improved by leveraging pseudo‑labels and Lidar intensity, an information rarely considered in current datasets. Code and data will be open sourced.

Abstract:
Open‑vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels, including those unseen during training. Self‑supervised learning resolves numerous visual and linguistic processing problems when effectively trained. This study investigates simple yet efficient methods for adapting previously learned foundation models for open‑vocabulary semantic segmentation tasks. Our research proposes "Beyond‑Labels", a lightweight transformer‑based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts. This strategy allows the model to leverage the extensive knowledge of pre‑trained models without requiring significant retraining, making the approach data‑efficient and scalable. Furthermore, we capture positional information in images using Fourier embeddings, improving generalization and enabling smooth and consistent spatial encoding. We perform thorough ablation studies to examine the main components of our proposed method. On the standard benchmark PASCAL‑5i, the method performs better despite being trained on frozen vision and language representations. Index Terms: Beyond‑Labels, open‑vocabulary semantic segmentation, Fourier embeddings, PASCAL‑5i

Abstract:
Automated interpretation of seismic images using deep learning methods is challenging because of the limited availability of training data. Few‑shot learning is a suitable learning paradigm in such scenarios due to its ability to adapt to a new task with limited supervision (small training budget). Existing few‑shot semantic segmentation (FSSS) methods fix the number of target classes. Therefore, they do not support joint training on multiple datasets varying in the number of classes. In the context of the interpretation of seismic facies, fixing the number of target classes inhibits the generalization capability of a model trained on one facies dataset to another, which is likely to have a different number of facies. To address this shortcoming, we propose a few‑shot semantic segmentation method for interpreting seismic facies that can adapt to the varying number of facies across the dataset, dubbed the AdaSemSeg. In general, the backbone network of FSSS methods is initialized with the statistics learned from the ImageNet dataset for better performance. The lack of such a huge annotated dataset for seismic images motivates using a self‑supervised algorithm on seismic datasets to initialize the backbone network. We have trained the AdaSemSeg on three public seismic facies datasets with different numbers of facies and evaluated the proposed method on multiple metrics. The performance of the AdaSemSeg on unseen datasets (not used in training) is better than the prototype‑based few‑shot method and baselines.

Abstract:
Semantic segmentation plays a crucial role in enabling machines to understand and interpret visual scenes at a pixel level. While traditional segmentation methods have achieved remarkable success, their generalization to diverse scenes and unseen object categories remains limited. Recent advancements in large language models (LLMs) offer a promising avenue for bridging visual and textual modalities, providing a deeper understanding of semantic relationships. In this paper, we propose LangSeg, a novel LLM‑guided semantic segmentation method that leverages context‑sensitive, fine‑grained subclass descriptors generated by LLMs. Our framework integrates these descriptors with a pre‑trained Vision Transformer (ViT) to achieve superior segmentation performance without extensive model retraining. We evaluate LangSeg on two challenging datasets, ADE20K and COCO‑Stuff, where it outperforms state‑of‑the‑art models, achieving up to a 6.1% improvement in mean Intersection over Union (mIoU). Additionally, we conduct a comprehensive ablation study and human evaluation to validate the effectiveness of our method in real‑world scenarios. The results demonstrate that LangSeg not only excels in semantic understanding and contextual alignment but also provides a flexible and efficient framework for language‑guided segmentation tasks. This approach opens up new possibilities for interactive and domain‑specific segmentation applications.

Abstract:
Current unsupervised domain adaptation (UDA) methods for semantic segmentation typically assume identical class labels between the source and target domains. This assumption ignores the label‑level domain gap, which is common in real‑world scenarios, thus limiting their ability to identify finer‑grained or novel categories without requiring extensive manual annotation. A promising direction to address this limitation lies in recent advancements in foundation models, which exhibit strong generalization abilities due to their rich prior knowledge. However, these models often struggle with domain‑specific nuances and underrepresented fine‑grained categories. To address these challenges, we introduce DynAlign, a framework that integrates UDA with foundation models to bridge both the image‑level and label‑level domain gaps. Our approach leverages prior semantic knowledge to align source categories with target categories that can be novel, more fine‑grained, or named differently (e.g., vehicle to car, truck, bus). Foundation models are then employed for precise segmentation and category reassignment. To further enhance accuracy, we propose a knowledge fusion approach that dynamically adapts to varying scene contexts. DynAlign generates accurate predictions in a new target label space without requiring any manual annotations, allowing seamless adaptation to new taxonomies through either model retraining or direct inference. Experiments on the street scene semantic segmentation benchmarks GTA to Mapillary Vistas and GTA to IDD validate the effectiveness of our approach, achieving a significant improvement over existing methods. Our code will be publicly available.

Abstract:
Vision Transformers (ViTs) have recently taken computer vision by storm. However, the softmax attention underlying ViTs comes with a quadratic complexity in time and memory, hindering the application of ViTs to high‑resolution images. We revisit the attention design and propose a linear attention method to address the limitation, which doesn't sacrifice ViT's core advantage of capturing global representation like existing methods (e.g. local window attention of Swin). We further investigate the key difference between linear attention and softmax attention. Our empirical results suggest that linear attention lacks a fundamental property of concentrating the distribution of the attention matrix. Inspired by this observation, we introduce a local concentration module to enhance linear attention. By incorporating enhanced linear global attention and local window attention, we propose a new ViT architecture, dubbed L^2ViT. Notably, L^2ViT can effectively capture both global interactions and local representations while enjoying linear computational complexity. Extensive experiments demonstrate the strong performance of L^2ViT. On image classification, L^2ViT achieves 84.4% Top‑1 accuracy on ImageNet‑1K without any extra training data or label. By further pre‑training on ImageNet‑22k, it attains 87.0% when fine‑tuned with resolution 384^2. For downstream tasks, L^2ViT delivers favorable performance as a backbone on object detection as well as semantic segmentation.

Abstract:
Learning effective deep portrait matting models requires training data of both high quality and large quantity. Neither quality nor quantity can be easily met for portrait matting, however. Since the most accurate ground‑truth portrait mattes are acquired in front of the green screen, it is almost impossible to harvest a large‑scale portrait matting dataset in reality. This work shows that one can leverage text prompts and the recent Layer Diffusion model to generate high‑quality portrait foregrounds and extract latent portrait mattes. However, the portrait mattes cannot be readily in use due to significant generation artifacts. Inspired by the connectivity priors observed in portrait images, that is, the border of portrait foregrounds always appears connected, a connectivity‑aware approach is introduced to refine portrait mattes. Building on this, a large‑scale portrait matting dataset is created, termed LD‑Portrait‑20K, with 20,051 portrait foregrounds and high‑quality alpha mattes. Extensive experiments demonstrated the value of the LD‑Portrait‑20K dataset, with models trained on it significantly outperforming those trained on other datasets. In addition, comparisons with the chroma keying algorithm and an ablation study on dataset capacity further confirmed the effectiveness of the proposed matte creation approach. Further, the dataset also contributes to state‑of‑the‑art video portrait matting, implemented by simple video segmentation and a trimap‑based image matting model trained on this dataset.

Abstract:
This paper introduces a novel approach to 4D Panoptic LiDAR Segmentation that decouples semantic and instance segmentation, leveraging single‑scan semantic predictions as prior information for instance segmentation. Our method D‑PLS first performs single‑scan semantic segmentation and aggregates the results over time, using them to guide instance segmentation. The modular design of D‑PLS allows for seamless integration on top of any semantic segmentation architecture, without requiring architectural changes or retraining. We evaluate our approach on the SemanticKITTI dataset, where it demonstrates significant improvements over the baseline in both classification and association tasks, as measured by the LiDAR Segmentation and Tracking Quality (LSTQ) metric. Furthermore, we show that our decoupled architecture not only enhances instance prediction but also surpasses the baseline due to advancements in single‑scan semantic segmentation.

Abstract:
Existing computer vision processing pipeline acquires visual information using an image sensor that captures pixel information in the Bayer pattern. The raw sensor data are then processed using an image signal processor (ISP) that first converts Bayer pixel data to RGB on a pixel by pixel basis, followed by video convolutional network (VCN) processing on a frame by frame basis. Both ISP and VCN are computationally expensive with high power consumption and latency. In this paper, we propose a novel framework that eliminates the ISP and leverages motion estimation to accelerate video vision tasks directly in the Bayer domain. We introduce Motion Estimation‑based Video Convolution (MEVC), which integrates sliding‑window motion estimation into each convolutional layer, enabling prediction and residual‑based refinement that reduces redundant computations across frames. This design bridges the structural gap between block‑based motion estimation and spatial convolution, enabling accurate, low‑cost processing. Our end‑to‑end pipeline supports raw Bayer input and achieves over 70% reduction in FLOPs with minimal accuracy degradation across video semantic segmentation, depth estimation, and object detection benchmarks, using both synthetic Bayer‑converted and real Bayer video datasets. This framework generalizes across convolution‑based models and marks the first effective reuse of motion estimation for accelerating video computer vision directly from raw sensor data.

Abstract:
Domain generalization aims to find ways for deep learning models to maintain their performance despite significant domain shifts between training and inference datasets. This is particularly important for models that need to be robust or are costly to train. LiDAR perception in autonomous driving is impacted by both of these concerns, leading to the emergence of various approaches. This work addresses the challenge by proposing a geometry‑based approach, leveraging the sequential structure of LiDAR sensors, which sets it apart from the learning‑based methods commonly found in the literature. The proposed method, called 3DLabelProp, is applied on the task of LiDAR Semantic Segmentation (LSS). Through extensive experimentation on seven datasets, it is demonstrated to be a state‑of‑the‑art approach, outperforming both naive and other domain generalization methods.

Abstract:
Ultrasonic testing is a common Non‑Destructive Inspection (NDI) method used in aerospace manufacturing. However, the complexity and size of the ultrasonic scans make it challenging to identify defects through visual inspection or machine learning models. Using computer vision techniques to identify defects from ultrasonic scans is an evolving research area. In this study, we used instance segmentation to identify the presence of defects in the ultrasonic scan images of composite panels that are representative of real components manufactured in aerospace. We used two models based on Mask‑RCNN (Detectron 2) and YOLO 11 respectively. Additionally, we implemented a simple statistical pre‑processing technique that reduces the burden of requiring custom‑tailored pre‑processing techniques. Our study demonstrates the feasibility and effectiveness of using instance segmentation in the NDI pipeline by significantly reducing data pre‑processing time, inspection time, and overall costs.

Abstract:
With the advancement of computing resources, an increasing number of Neural Networks (NNs) are appearing for image detection and segmentation appear. However, these methods usually accept as input a RGB 2D image. On the other side, Light Detection And Ranging (LiDAR) sensors with many layers provide images that are similar to those obtained from a traditional low resolution RGB camera. Following this principle, a new dataset for segmenting cars in pseudo‑RGB images has been generated. This dataset combines the information given by the LiDAR sensor into a Spherical Range Image (SRI), concretely the reflectivity, near infrared and signal intensity 2D images. These images are then fed into instance segmentation NNs. These NNs segment the cars that appear in these images, having as result a Bounding Box (BB) and mask precision of 88% and 81.5% respectively with You Only Look Once (YOLO)‑v8 large. By using this segmentation NN, some trackers have been applied so as to follow each car segmented instance along a video feed, having great performance in real world experiments.

Abstract:
Accurate prediction of pedestrian trajectories is crucial for enhancing the safety of autonomous vehicles and reducing traffic fatalities involving pedestrians. While numerous studies have focused on modeling interactions among pedestrians to forecast their movements, the influence of environmental factors and scene‑object placements has been comparatively underexplored. In this paper, we present a novel trajectory prediction model that integrates both pedestrian interactions and environmental context to improve prediction accuracy. Our approach captures spatial and temporal interactions among pedestrians within a sparse graph framework. To account for pedestrian‑scene interactions, we employ advanced image enhancement and semantic segmentation techniques to extract detailed scene features. These scene and interaction features are then fused through a cross‑attention mechanism, enabling the model to prioritize relevant environmental factors that influence pedestrian movements. Finally, a temporal convolutional network processes the fused features to predict future pedestrian trajectories. Experimental results demonstrate that our method significantly outperforms existing state‑of‑the‑art approaches, achieving ADE and FDE values of 0.252 and 0.372 meters, respectively, underscoring the importance of incorporating both social interactions and environmental context in pedestrian trajectory prediction.

Abstract:
Few‑shot Semantic Segmentation (FSS) is a challenging task that utilizes limited support images to segment associated unseen objects in query images. However, recent FSS methods are observed to perform worse, when enlarging the number of shots. As the support set enlarges, existing FSS networks struggle to concentrate on the high‑contributed supports and could easily be overwhelmed by the low‑contributed supports that could severely impair the mask predictions. In this work, we study this challenging issue, called support dilution, our goal is to recognize, select, preserve, and enhance those high‑contributed supports in the raw support pool. Technically, our method contains three novel parts. First, we propose a contribution index, to quantitatively estimate if a high‑contributed support dilutes. Second, we develop the Symmetric Correlation (SC) module to preserve and enhance the high‑contributed support features, minimizing the distraction by the low‑contributed features. Third, we design the Support Image Pruning operation, to retrieve a compact and high quality subset by discarding low‑contributed supports. We conduct extensive experiments on two FSS benchmarks, COCO‑20i and PASCAL‑5i, the segmentation results demonstrate the compelling performance of our solution over state‑of‑the‑art FSS approaches. Besides, we apply our solution for online segmentation and real‑world segmentation, convincing segmentation results showing the practical ability of our work for real‑world demonstrations.

Abstract:
Dynamic urban environments, characterized by moving cameras and objects, pose significant challenges for camera trajectory estimation by complicating the distinction between camera‑induced and object motion. We introduce MONA, a novel framework designed for robust moving object detection and segmentation from videos shot by dynamic cameras. MONA comprises two key modules: Dynamic Points Extraction, which leverages optical flow and tracking any point to identify dynamic points, and Moving Object Segmentation, which employs adaptive bounding box filtering, and the Segment Anything for precise moving object segmentation. We validate MONA by integrating with the camera trajectory estimation method LEAP‑VO, and it achieves state‑of‑the‑art results on the MPI Sintel dataset comparing to existing methods. These results demonstrate MONA's effectiveness for moving object detection and its potential in many other applications in the urban planning field.

Abstract:
Monocular depth estimation (MDE) is a challenging task in computer vision, often hindered by the cost and scarcity of high‑quality labeled datasets. We tackle this challenge using auxiliary datasets from related vision tasks for an alternating training scheme with a shared decoder built on top of a pre‑trained vision foundation model, while giving a higher weight to MDE. Through extensive experiments we demonstrate the benefits of incorporating various in‑domain auxiliary datasets and tasks to improve MDE quality on average by ~11%. Our experimental analysis shows that auxiliary tasks have different impacts, confirming the importance of task selection, highlighting that quality gains are not achieved by merely adding data. Remarkably, our study reveals that using semantic segmentation datasets as Multi‑Label Dense Classification (MLDC) often results in additional quality gains. Lastly, our method significantly improves the data efficiency for the considered MDE datasets, enhancing their quality while reducing their size by at least 80%. This paves the way for using auxiliary data from related tasks to improve MDE quality despite limited availability of high‑quality labeled data. Code is available at https://jugit.fz‑juelich.de/ias‑8/mdeaux.

Abstract:
Histopathological image analysis is a reliable method for prostate cancer identification. In this paper, we present a comparative analysis of two approaches for segmenting glandular structures in prostate images to automate Gleason grading. The first approach utilizes a hand‑crafted learning technique, combining Gray Level Co‑Occurrence Matrix (GLCM) and Local Binary Pattern (LBP) texture descriptors to highlight spatial dependencies and minimize information loss at the pixel level. For machine driven feature extraction, we employ a U‑Net convolutional neural network to perform semantic segmentation of prostate gland stroma tissue. Support vector machine‑based learning of hand‑crafted features achieves impressive classification accuracies of 99.0% and 95.1% for GLCM and LBP, respectively, while the U‑Net‑based machine‑driven features attain 94% accuracy. Furthermore, a comparative analysis demonstrates superior segmentation quality for histopathological grades 1, 2, 3, and 4 using the U‑Net approach, as assessed by Jaccard and Dice metrics. This work underscores the utility of machine‑driven features in clinical applications that rely on automated pixel‑level segmentation in prostate tissue images.

Abstract:
Advanced Driver Assistance Systems (ADAS) based on deep neural networks (DNNs) are widely used in autonomous vehicles for critical perception tasks such as object detection, semantic segmentation, and lane recognition. However, these systems are highly sensitive to input variations, such as noise and changes in lighting, which can compromise their effectiveness and potentially lead to safety‑critical failures. This study offers a comprehensive empirical evaluation of image perturbations, techniques commonly used to assess the robustness of DNNs, to validate and improve the robustness and generalization of ADAS perception systems. We first conducted a systematic review of the literature, identifying 38 categories of perturbations. Next, we evaluated their effectiveness in revealing failures in two different ADAS, both at the component and at the system level. Finally, we explored the use of perturbation‑based data augmentation and continuous learning strategies to improve ADAS adaptation to new operational design domains. Our results demonstrate that all categories of image perturbations successfully expose robustness issues in ADAS and that the use of dataset augmentation and continuous learning significantly improves ADAS performance in novel, unseen environments.

Abstract:
Cross‑entropy (CE) loss is the de‑facto standard for training deep neural networks to perform classification. However, CE‑trained deep neural networks struggle with robustness and generalisation issues. To alleviate these issues, we propose high error margin (HEM) loss, a variant of multi‑class margin loss that overcomes the training issues of other margin‑based losses. We evaluate HEM extensively on a range of architectures and datasets. We find that HEM loss is more effective than cross‑entropy loss across a wide range of tasks: unknown class rejection, adversarial robustness, learning with imbalanced data, continual learning, and semantic segmentation (a pixel‑level classification task). Despite all training hyper‑parameters being chosen for CE loss, HEM is inferior to CE only in terms of clean accuracy and this difference is insignificant. We also compare HEM to specialised losses that have previously been proposed to improve performance on specific tasks. LogitNorm, a loss achieving state‑of‑the‑art performance on unknown class rejection, produces similar performance to HEM for this task, but is much poorer for continual learning and semantic segmentation. Logit‑adjusted loss, designed for imbalanced data, has superior results to HEM for that task, but performs more poorly on unknown class rejection and semantic segmentation. DICE, a popular loss for semantic segmentation, is inferior to HEM loss on all tasks, including semantic segmentation. Thus, HEM often out‑performs specialised losses, and in contrast to them, is a general‑purpose replacement for CE loss.

Abstract:
In this paper, we tackle the challenge of instance segmentation for foreign objects in chest radiographs, commonly seen in postoperative follow‑ups with stents, pacemakers, or ingested objects in children. The diversity of foreign objects complicates dense annotation, as shown in insufficient existing datasets. To address this, we propose the simple generation of synthetic data through (1) insertion of arbitrary shapes (lines, polygons, ellipses) with varying contrasts and opacities, and (2) cut‑paste augmentations from a small set of semi‑automatically extracted labels. These insertions are guided by anatomy labels to ensure realistic placements, such as stents appearing only in relevant vessels. Our approach enables networks to segment complex structures with minimal manually labeled data. Notably, it achieves performance comparable to fully supervised models while using 93% fewer manual annotations.

Abstract:
Structural integrity is vital for maintaining the safety and longevity of concrete infrastructures such as bridges, tunnels, and walls. Traditional methods for detecting damages like cracks and spalls are labor‑intensive, time‑consuming, and prone to human error. To address these challenges, this study explores advanced data‑driven techniques using deep learning for automated damage detection and analysis. Two state‑of‑the‑art instance segmentation models, YOLO‑v7 instance segmentation and Mask R‑CNN, were evaluated using a dataset comprising 400 images, augmented to 10,995 images through geometric and color‑based transformations to enhance robustness. The models were trained and validated using a dataset split into 90% training set, validation and test set 10%. Performance metrics such as precision, recall, mean average precision (mAP@0.5), and frames per second (FPS) were used for evaluation. YOLO‑v7 achieved a superior mAP@0.5 of 96.1% and processed 40 FPS, outperforming Mask R‑CNN, which achieved a mAP@0.5 of 92.1% with a slower processing speed of 18 FPS. The findings recommend YOLO‑v7 instance segmentation model for real‑time, high‑speed structural health monitoring, while Mask R‑CNN is better suited for detailed offline assessments. This study demonstrates the potential of deep learning to revolutionize infrastructure maintenance, offering a scalable and efficient solution for automated damage detection.

Abstract:
In this paper, an automatic labelling process is presented for automotive datasets, leveraging on complementary information from LiDAR and camera. The generated labels are then used as ground truth with the corresponding 4D radar data as inputs to a proposed semantic segmentation network, to associate a class label to each spatial voxel. Promising results are shown by applying both approaches to the publicly shared RaDelft dataset, with the proposed network achieving over 65% of the LiDAR detection performance, improving 13.2% in vehicle detection probability, and reducing 0.54 m in terms of Chamfer distance, compared to variants inspired from the literature.

Abstract:
Phenotype segmentation is pivotal in analysing visual features of living organisms, enhancing our understanding of their characteristics. In the context of oysters, meat quality assessment is paramount, focusing on shell, meat, gonad, and muscle components. Traditional manual inspection methods are time‑consuming and subjective, prompting the adoption of machine vision technology for efficient and objective evaluation. We explore machine vision's capacity for segmenting oyster components, leading to the development of a multi‑network ensemble approach with a global‑local hierarchical attention mechanism. This approach integrates predictions from diverse models and addresses challenges posed by varying scales, ensuring robust instance segmentation across components. Finally, we provide a comprehensive evaluation of the proposed method's performance using different real‑world datasets, highlighting its efficacy and robustness in enhancing oyster phenotype segmentation.

Abstract:
The interest in leveraging Artificial Intelligence (AI) for surgical procedures to automate analysis has witnessed a significant surge in recent years. One of the primary tools for recording surgical procedures and conducting subsequent analyses, such as performance assessment, is through videos. However, these operative videos tend to be notably lengthy compared to other fields, spanning from thirty minutes to several hours, which poses a challenge for AI models to effectively learn from them. Despite this challenge, the foreseeable increase in the volume of such videos in the near future necessitates the development and implementation of innovative techniques to tackle this issue effectively. In this article, we propose a novel technique called Kinematics Adaptive Frame Recognition (KAFR) that can efficiently eliminate redundant frames to reduce dataset size and computation time while retaining useful frames to improve accuracy. Specifically, we compute the similarity between consecutive frames by tracking the movement of surgical tools. Our approach follows these steps: i) Tracking phase: a YOLOv8 model is utilized to detect tools presented in the scene, ii) Similarity phase: Similarities between consecutive frames are computed by estimating variation in the spatial positions and velocities of the tools, iii) Classification phase: An X3D CNN is trained to classify segmentation. We evaluate the effectiveness of our approach by analyzing datasets obtained through retrospective reviews of cases at two referral centers. The newly annotated Gastrojejunostomy (GJ) dataset covers procedures performed between 2017 and 2021, while the previously annotated Pancreaticojejunostomy (PJ) dataset spans from 2011 to 2022 at the same centers.

Abstract:
RGB and thermal image fusion have great potential to exhibit improved semantic segmentation in low‑illumination conditions. Existing methods typically employ a two‑branch encoder framework for multimodal feature extraction and design complicated feature fusion strategies to achieve feature extraction and fusion for multimodal semantic segmentation. However, these methods require massive parameter updates and computational effort during the feature extraction and fusion. To address this issue, we propose a novel multimodal fusion network (EFNet) based on an early fusion strategy and a simple but effective feature clustering for training efficient RGB‑T semantic segmentation. In addition, we also propose a lightweight and efficient multi‑scale feature aggregation decoder based on Euclidean distance. We validate the effectiveness of our method on different datasets and outperform previous state‑of‑the‑art methods with lower parameters and computation.

Abstract:
High‑resolution land cover mapping plays a crucial role in addressing a wide range of global challenges, including urban planning, environmental monitoring, disaster response, and sustainable development. However, creating accurate, large‑scale land cover datasets remains a significant challenge due to the inherent complexities of geospatial data, such as diverse terrain, varying sensor modalities, and atmospheric conditions. Synthetic Aperture Radar (SAR) imagery, with its ability to penetrate clouds and capture data in all‑weather, day‑and‑night conditions, offers unique advantages for land cover mapping. Despite these strengths, the lack of benchmark datasets tailored for SAR imagery has limited the development of robust models specifically designed for this data modality. To bridge this gap and facilitate advancements in SAR‑based geospatial analysis, we introduce OpenEarthMap‑SAR, a benchmark SAR dataset, for global high‑resolution land cover mapping. OpenEarthMap‑SAR consists of 1.5 million segments of 5033 aerial and satellite images with the size of 1024×1024 pixels, covering 35 regions from Japan, France, and the USA, with partially manually annotated and fully pseudo 8‑class land cover labels at a ground sampling distance of 0.15‑‑0.5 m. We evaluated the performance of state‑of‑the‑art methods for semantic segmentation and present challenging problem settings suitable for further technical development. The dataset also serves the official dataset for IEEE GRSS Data Fusion Contest Track I. The dataset has been made publicly available at https://zenodo.org/records/14622048.

Abstract:
Augmentation by generative modelling yields a promising alternative to the accumulation of surgical data, where ethical, organisational and regulatory aspects must be considered. Yet, the joint synthesis of (image, mask) pairs for segmentation, a major application in surgery, is rather unexplored. We propose to learn semantically comprehensive yet compact latent representations of the (image, mask) space, which we jointly model with a Latent Diffusion Model. We show that our approach can effectively synthesise unseen high‑quality paired segmentation data of remarkable semantic coherence. Generative augmentation is typically applied pre‑training by synthesising a fixed number of additional training samples to improve downstream task models. To enhance this approach, we further propose Generative Adaptive Uncertainty‑guided Diffusion‑based Augmentation (GAUDA), leveraging the epistemic uncertainty of a Bayesian downstream model for targeted online synthesis. We condition the generative model on classes with high estimated uncertainty during training to produce additional unseen samples for these classes. By adaptively utilising the generative model online, we can minimise the number of additional training samples and centre them around the currently most uncertain parts of the data distribution. GAUDA effectively improves downstream segmentation results over comparable methods by an average absolute IoU of 1.6% on CaDISv2 and 1.5% on CholecSeg8k, two prominent surgical datasets for semantic segmentation.

Abstract:
Semi‑supervised learning offers an appealing solution for remote sensing (RS) image segmentation to relieve the burden of labor‑intensive pixel‑level labeling. However, RS images pose unique challenges, including rich multi‑scale features and high inter‑class similarity. To address these problems, this paper proposes a novel semi‑supervised Multi‑Scale Uncertainty and Cross‑Teacher‑Student Attention (MUCA) model for RS image semantic segmentation tasks. Specifically, MUCA constrains the consistency among feature maps at different layers of the network by introducing a multi‑scale uncertainty consistency regularization. It improves the multi‑scale learning capability of semi‑supervised algorithms on unlabeled data. Additionally, MUCA utilizes a Cross‑Teacher‑Student attention mechanism to guide the student network, guiding the student network to construct more discriminative feature representations through complementary features from the teacher network. This design effectively integrates weak and strong augmentations (WA and SA) to further boost segmentation performance. To verify the effectiveness of our model, we conduct extensive experiments on ISPRS‑Potsdam and LoveDA datasets. The experimental results show the superiority of our method over state‑of‑the‑art semi‑supervised methods. Notably, our model excels in distinguishing highly similar objects, showcasing its potential for advancing semi‑supervised RS image segmentation tasks.

Abstract:
Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive k‑Nearest Neighbors (k‑NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra‑graph and global inter‑graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN‑GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end‑to‑end inference latency for vision tasks by up to 5× when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state‑of‑the‑art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher‑resolution images, underscoring the scalability of our approach.

Abstract:
Light‑weight neural networks for remote sensing (RS) visual analysis must overcome two inherent redundancies: spatial redundancy from vast, homogeneous backgrounds, and channel redundancy, where extreme scale variations render a single feature space inefficient. Existing models, often designed for natural images, fail to address this dual challenge in RS scenarios. To bridge this gap, we propose LWGANet, a light‑weight backbone engineered for RS‑specific properties. LWGANet introduces two core innovations: a Top‑K Global Feature Interaction (TGFI) module that mitigates spatial redundancy by focusing computation on salient regions, and a Light‑Weight Grouped Attention (LWGA) module that resolves channel redundancy by partitioning channels into specialized, scale‑specific pathways. By synergistically resolving these core inefficiencies, LWGANet achieves a superior trade‑off between feature representation quality and computational cost. Extensive experiments on twelve diverse datasets across four major RS tasks‑‑scene classification, oriented object detection, semantic segmentation, and change detection‑‑demonstrate that LWGANet consistently outperforms state‑of‑the‑art light‑weight backbones in both accuracy and efficiency. Our work establishes a new, robust baseline for efficient visual analysis in RS images.

Abstract:
LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera‑LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image‑to‑LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre‑trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.

Abstract:
Small object segmentation, like tumor segmentation, is a difficult and critical task in the field of medical image analysis. Although deep learning based methods have achieved promising performance, they are restricted to the use of binary segmentation mask. Inspired by the rigorous mapping between binary segmentation mask and distance map, we adopt distance map as a novel ground truth and employ a network to fulfill the computation of distance map. Specially, we propose a new segmentation framework that incorporates the existing binary segmentation network and a light weight regression network (dubbed as LR‑Net). Thus, the LR‑Net can convert the distance map computation into a regression task and leverage the rich information of distance maps. Additionally, we derive a shape‑aware loss by employing distance maps as penalty map to infer the complete shape of an object. We evaluated our approach on MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge dataset and a clinical dataset. Experimental results show that our approach outperforms the classification‑based methods as well as other existing state‑of‑the‑arts.

Abstract:
Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image‑text methods like CLIP enable vision‑language alignment and zero‑shot classification ability, vision‑only downstream performance tends to degrade compared to image‑only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision‑only tasks such as KNN classification and semantic segmentation, +6% mIOU on SpaceNet1, while retaining the ability to perform zero‑shot classification, unlike MAE pretrained methods.

Abstract:
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE‑free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end‑to‑end training with high‑resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state‑of‑the‑art performance in future semantic segmentation for both short‑ and mid‑term forecasting. Project page and code at https://futurist‑cvpr2025.github.io/ .

Abstract:
While recent foundation models have enabled significant breakthroughs in monocular depth estimation, a clear path towards safe and reliable deployment in the real‑world remains elusive. Metric depth estimation, which involves predicting absolute distances, poses particular challenges, as even the most advanced foundation models remain prone to critical errors. Since quantifying the uncertainty has emerged as a promising endeavor to address these limitations and enable trustworthy deployment, we fuse five different uncertainty quantification methods with the current state‑of‑the‑art DepthAnythingV2 foundation model. To cover a wide range of metric depth domains, we evaluate their performance on four diverse datasets. Our findings identify fine‑tuning with the Gaussian Negative Log‑Likelihood Loss (GNLL) as a particularly promising approach, offering reliable uncertainty estimates while maintaining predictive performance and computational efficiency on par with the baseline, encompassing both training and inference time. By fusing uncertainty quantification and foundation models within the context of monocular depth estimation, this paper lays a critical foundation for future research aimed at improving not only model performance but also its explainability. Extending this critical synthesis of uncertainty quantification and foundation models into other crucial tasks, such as semantic segmentation and pose estimation, presents exciting opportunities for safer and more reliable machine vision systems.

Abstract:
Semantic segmentation of remote sensing images is essential for various applications, including vegetation monitoring, disaster management, and urban planning. Previous studies have demonstrated that the self‑attention mechanism (SA) is an effective approach for designing segmentation networks that can capture long‑range pixel dependencies. SA enables the network to model the global dependencies between the input features, resulting in improved segmentation outcomes. However, the high density of attentional feature maps used in this mechanism causes exponential increases in computational complexity. Additionally, it introduces redundant information that negatively impacts the feature representation. Inspired by traditional threshold segmentation algorithms, we propose a novel threshold attention mechanism (TAM). This mechanism significantly reduces computational effort while also better modeling the correlation between different regions of the feature map. Based on TAM, we present a threshold attention network (TANet) for semantic segmentation. TANet consists of an attentional feature enhancement module (AFEM) for global feature enhancement of shallow features and a threshold attention pyramid pooling module (TAPP) for acquiring feature information at different scales for deep features. We have conducted extensive experiments on the ISPRS Vaihingen and Potsdam datasets. The results demonstrate the validity and superiority of our proposed TANet compared to the most state‑of‑the‑art models.

Abstract:
Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback‑Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher's ''dark knowledge'' because the divergence calculations may ignore the effect of the minute probabilities from the teacher's logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback‑Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR‑100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP‑ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.

Abstract:
Semantic segmentation plays a crucial role in remote sensing applications, where the accurate extraction and representation of features are essential for high‑quality results. Despite the widespread use of encoder‑decoder architectures, existing methods often struggle with fully utilizing the high‑dimensional features extracted by the encoder and efficiently recovering detailed information during decoding. To address these problems, we propose a novel semantic segmentation network, namely DeepKANSeg, including two key innovations based on the emerging Kolmogorov Arnold Network (KAN). Notably, the advantage of KAN lies in its ability to decompose high‑dimensional complex functions into univariate transformations, enabling efficient and flexible representation of intricate relationships in data. First, we introduce a KAN‑based deep feature refinement module, namely DeepKAN to effectively capture complex spatial and rich semantic relationships from high‑dimensional features. Second, we replace the traditional multi‑layer perceptron (MLP) layers in the global‑local combined decoder with KAN‑based linear layers, namely GLKAN. This module enhances the decoder's ability to capture fine‑grained details during decoding. To evaluate the effectiveness of the proposed method, experiments are conducted on two well‑known fine‑resolution remote sensing benchmark datasets, namely ISPRS Vaihingen and ISPRS Potsdam. The results demonstrate that the KAN‑enhanced segmentation model achieves superior performance in terms of accuracy compared to state‑of‑the‑art methods. They highlight the potential of KANs as a powerful alternative to traditional architectures in semantic segmentation tasks. Moreover, the explicit univariate decomposition provides improved interpretability, which is particularly beneficial for applications requiring explainable learning in remote sensing.

Abstract:
This paper addresses the domain adaptation challenge for semantic segmentation in medical imaging. Despite the impressive performance of recent foundational segmentation models like SAM on natural images, they struggle with medical domain images. Beyond this, recent approaches that perform end‑to‑end fine‑tuning of models are simply not computationally tractable. To address this, we propose a novel SAM adapter approach that minimizes the number of trainable parameters while achieving comparable performances to full fine‑tuning. The proposed SAM adapter is strategically placed in the mask decoder, offering excellent and broad generalization capabilities and improved segmentation across both fully supervised and test‑time domain adaptation tasks. Extensive validation on four datasets showcases the adapter's efficacy, outperforming existing methods while training less than 1% of SAM's total parameters.

Abstract:
We study image segmentation in the biological domain, particularly trait segmentation from specimen images (e.g., butterfly wing stripes, beetle elytra). This fine‑grained task is crucial for understanding the biology of organisms, but it traditionally requires manually annotating segmentation masks for hundreds of images per species, making it highly labor‑intensive. To address this challenge, we propose a label‑efficient approach, Static Segmentation by Tracking (SST), based on a key insight: while specimens of the same species exhibit natural variation, the traits of interest show up consistently. This motivates us to concatenate specimen images into a ``pseudo‑video'' and reframe trait segmentation as a tracking problem. Specifically, SST generates masks for unlabeled images by propagating annotated or predicted masks from the ``pseudo‑preceding'' images. Built upon recent video segmentation models, such as Segment Anything Model 2, SST achieves high‑quality trait segmentation with only one labeled image per species, marking a breakthrough in specimen image analysis. To further enhance segmentation quality, we introduce a cycle‑consistent loss for fine‑tuning, again requiring only one labeled image. Additionally, we demonstrate the broader potential of SST, including one‑shot instance segmentation in natural images and trait‑based image retrieval.

Abstract:
This paper addresses the challenge of parking space detection in urban areas, focusing on the city of Granada. Utilizing aerial imagery, we develop and apply semantic segmentation techniques to accurately identify parked cars, moving cars and roads. A significant aspect of our research is the creation of a proprietary dataset specific to Granada, which is instrumental in training our neural network model. We employ Fully Convolutional Networks, Pyramid Networks and Dilated Convolutions, demonstrating their effectiveness in urban semantic segmentation. Our approach involves comparative analysis and optimization of various models, including Dynamic U‑Net, PSPNet and DeepLabV3+, tailored for the segmentation of aerial images. The study includes a thorough experimentation phase, using datasets such as UDD5 and UAVid, alongside our custom Granada dataset. We evaluate our models using metrics like Foreground Accuracy, Dice Coefficient and Jaccard Index. Our results indicate that DeepLabV3+ offers the most promising performance. We conclude with future directions, emphasizing the need for a dedicated neural network for parked car detection and the potential for application in other urban environments. This work contributes to the fields of urban planning and traffic management, providing insights into efficient utilization of parking spaces through advanced image processing techniques.

Abstract:
Current approaches to dichotomous image segmentation (DIS) treat image matting and object segmentation as fundamentally different tasks. As improvements in image segmentation become increasingly challenging to achieve, combining image matting and grayscale segmentation techniques offers promising new directions for architectural innovation. Inspired by the possibility of aligning these two model tasks, we propose a new architectural approach for DIS called Confidence‑Guided Matting (CGM). We created the first CGM model called Background Erase Network (BEN). BEN consists of two components: BEN Base for initial segmentation and BEN Refiner for confidence‑based refinement. Our approach achieves substantial improvements over current state‑of‑the‑art methods on the DIS5K validation dataset, demonstrating that matting‑based refinement can significantly enhance segmentation quality. This work introduces a new paradigm for integrating matting and segmentation techniques, improving fine‑grained object boundary prediction in computer vision.

Abstract:
3D semantic segmentation is one of the most crucial tasks in driving perception. The ability of a learning‑based model to accurately perceive dense 3D surroundings often ensures the safe operation of autonomous vehicles. However, existing LiDAR‑based 3D semantic segmentation databases consist of sequentially acquired LiDAR scans that are long‑tailed and lack training diversity. In this report, we introduce MixSeg3D, a sophisticated combination of the strong point cloud segmentation model with advanced 3D data mixing strategies. Specifically, our approach integrates the MinkUNet family with LaserMix and PolarMix, two scene‑scale data augmentation methods that blend LiDAR point clouds along the ego‑scene's inclination and azimuth directions. Through empirical experiments, we demonstrate the superiority of MixSeg3D over the baseline and prior arts. Our team achieved 2nd place in the 3D semantic segmentation track of the 2024 Waymo Open Dataset Challenge.

Abstract:
Semantic segmentation for autonomous driving is an even more challenging task when faced with adverse driving conditions. Standard models trained on data recorded under ideal conditions show a deteriorated performance in unfavorable weather or illumination conditions. Fine‑tuning on the new task or condition would lead to overwriting the previously learned information resulting in catastrophic forgetting. Adapting to the new conditions through traditional domain adaption methods improves the performance on the target domain at the expense of the source domain. Addressing these issues, we propose an architecture‑based domain‑incremental learning approach called Progressive Semantic Segmentation (PSS). PSS is a task‑agnostic, dynamically growing collection of domain‑specific segmentation models. The task of inferring the domain and subsequently selecting the appropriate module for segmentation is carried out using a collection of convolutional autoencoders. We extensively evaluate our proposed approach using several datasets at varying levels of granularity in the categorization of adverse driving conditions. Furthermore, we demonstrate the generalization of the proposed approach to similar and unseen domains.

Abstract:
We present Seg‑TTO, a novel framework for zero‑shot, open‑vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open‑vocabulary approaches show impressive performance on standard segmentation benchmarks under zero‑shot settings, they fall short of supervised counterparts on highly domain‑specific datasets. We focus on segmentation‑specific test‑time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self‑supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel‑level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg‑TTO is a plug‑and‑play module. We integrate Seg‑TTO with three state‑of‑the‑art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg‑TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state‑of‑the‑art. Our code and models will be released publicly.

Abstract:
Despite widespread adoption of deep learning models to address a variety of computer vision tasks, planetary science has yet to see extensive utilization of such tools to address its unique problems. On Titan, the largest moon of Saturn, tracking seasonal trends and weather patterns of clouds provides crucial insights into one of the most complex climates in the Solar System, yet much of the available image data are still analyzed in a conventional way. In this work, we apply a Mask R‑CNN trained via transfer learning to perform instance segmentation of clouds in Titan images acquired by the Cassini spacecraft ‑ a previously unexplored approach to a big data problem in planetary science. We demonstrate that an automated technique can provide quantitative measures for clouds, such as areas and centroids, that may otherwise be prohibitively time‑intensive to produce by human mapping. Furthermore, despite Titan specific challenges, our approach yields accuracy comparable to contemporary cloud identification studies on Earth and other worlds. We compare the efficiencies of human‑driven versus algorithmic approaches, showing that transfer learning provides speed‑ups that may open new horizons for data investigation for Titan. Moreover, we suggest that such approaches have broad potential for application to similar problems in planetary science where they are currently under‑utilized. Future planned missions to the planets and remote sensing initiatives for the Earth promise to provide a deluge of image data in the coming years that will benefit strongly from leveraging machine learning approaches to perform the analysis.

Abstract:
With the rapid advancement of deep learning, computational pathology has made significant progress in cancer diagnosis and subtyping. Tissue segmentation is a core challenge, essential for prognosis and treatment decisions. Weakly supervised semantic segmentation (WSSS) reduces the annotation requirement by using image‑level labels instead of pixel‑level ones. However, Class Activation Map (CAM)‑based methods still suffer from low spatial resolution and unclear boundaries. To address these issues, we propose a multi‑level superpixel correction algorithm that refines CAM boundaries using superpixel clustering and floodfill. Experimental results show that our method achieves great performance on breast cancer segmentation dataset with mIoU of 71.08%, significantly improving tumor microenvironment boundary delineation.

Abstract:
Automated fish documentation processes are in the near future expected to play an essential role in sustainable fisheries management and for addressing challenges of overfishing. In this paper, we present a novel and publicly available dataset named AutoFish designed for fine‑grained fish analysis. The dataset comprises 1,500 images of 454 specimens of visually similar fish placed in various constellations on a white conveyor belt and annotated with instance segmentation masks, IDs, and length measurements. The data was collected in a controlled environment using an RGB camera. The annotation procedure involved manual point annotations, initial segmentation masks proposed by the Segment Anything Model (SAM), and subsequent manual correction of the masks. We establish baseline instance segmentation results using two variations of the Mask2Former architecture, with the best performing model reaching an mAP of 89.15%. Additionally, we present two baseline length estimation methods, the best performing being a custom MobileNetV2‑based regression model reaching an MAE of 0.62cm in images with no occlusion and 1.38cm in images with occlusion. Link to project page: https://vap.aau.dk/autofish/.

Abstract:
This study explores the potential of graph neural networks (GNNs) to enhance semantic segmentation across diverse image modalities. We evaluate the effectiveness of a novel GNN‑based U‑Net architecture on three distinct datasets: PascalVOC, a standard benchmark for natural image segmentation, WoodScape, a challenging dataset of fisheye images commonly used in autonomous driving, introducing significant geometric distortions; and ISIC2016, a dataset of dermoscopic images for skin lesion segmentation. We compare our proposed UNet‑GNN model against established convolutional neural networks (CNNs) based segmentation models, including U‑Net and U‑Net++, as well as the transformer‑based SwinUNet. Unlike these methods, which primarily rely on local convolutional operations or global self‑attention, GNNs explicitly model relationships between image regions by constructing and operating on a graph representation of the image features. This approach allows the model to capture long‑range dependencies and complex spatial relationships, which we hypothesize will be particularly beneficial for handling geometric distortions present in fisheye imagery and capturing intricate boundaries in medical images. Our analysis demonstrates the versatility of GNNs in addressing diverse segmentation challenges and highlights their potential to improve segmentation accuracy in various applications, including autonomous driving and medical image analysis.

Abstract:
This paper presents Camera‑LiDAR Fusion Transformer (CLFT) models for traffic object segmentation, which leverage the fusion of camera and LiDAR data using vision transformers. Building on the methodology of visual transformers that exploit the self‑attention mechanism, we extend segmentation capabilities with additional classification options to a diverse class of objects including cyclists, traffic signs, and pedestrians across diverse weather conditions. Despite good performance, the models face challenges under adverse conditions which underscores the need for further optimization to enhance performance in darkness and rain. In summary, the CLFT models offer a compelling solution for autonomous driving perception, advancing the state‑of‑the‑art in multimodal fusion and object segmentation, with ongoing efforts required to address existing limitations and fully harness their potential in practical deployments.

Abstract:
Semantic segmentation is a computer vision task where classification is performed at a pixel level. Due to this, the process of labeling images for semantic segmentation is time‑consuming and expensive. To mitigate this cost there has been a surge in the use of synthetically generated data ‑‑ usually created using simulators or videogames ‑‑ which, in combination with domain adaptation methods, can effectively learn how to segment real data. Still, these datasets have a particular limitation: due to their closed‑set nature, it is not possible to include novel classes without modifying the tool used to generate them, which is often not public. Concurrently, generative models have made remarkable progress, particularly with the introduction of diffusion models, enabling the creation of high‑quality images from text prompts without additional supervision. In this work, we propose an unsupervised pipeline that leverages Stable Diffusion and Segment Anything Module to generate class examples with an associated segmentation mask, and a method to integrate generated cutouts for novel classes in semantic segmentation datasets, all with minimal user input. Our approach aims to improve the performance of unsupervised domain adaptation methods by introducing novel samples into the training data without modifications to the underlying algorithms. With our methods, we show how models can not only effectively learn how to segment novel classes, with an average performance of 51% IoU, but also reduce errors for other, already existing classes, reaching a higher performance level overall.

Abstract:
Historical maps are valuable resources that capture detailed geographical information from the past. However, these maps are typically available in printed formats, which are not conducive to modern computer‑based analyses. Digitizing these maps into a machine‑readable format enables efficient computational analysis. In this paper, we propose an automated approach to digitization using deep‑learning‑based semantic segmentation, which assigns a semantic label to each pixel in scanned historical maps. A key challenge in this process is the lack of ground‑truth annotations required for training deep neural networks, as manual labeling is time‑consuming and labor‑intensive. To address this issue, we introduce a weakly‑supervised age‑tracing strategy for model fine‑tuning. This approach exploits the similarity in appearance and land‑use patterns between historical maps from neighboring time periods to guide the training process. Specifically, model predictions for one map are utilized as pseudo‑labels for training on maps from adjacent time periods. Experiments conducted on our newly curated Hameln dataset demonstrate that the proposed age‑tracing strategy significantly enhances segmentation performance compared to baseline models. In the best‑case scenario, the mean Intersection over Union (mIoU) achieved 77.3%, reflecting an improvement of approximately 20% over baseline methods. Additionally, the fine‑tuned model achieved an average overall accuracy of 97%, highlighting the effectiveness of our approach for digitizing historical maps.

Abstract:
Reducing computational costs is an important issue for development of embedded systems. Binary‑weight Neural Networks (BNNs), in which weights are binarized and activations are quantized, are employed to reduce computational costs of various kinds of applications. In this paper, a design methodology of hardware architecture for inference engines is proposed to handle modern BNNs with two operation modes. Multiply‑Accumulate (MAC) operations can be simplified by replacing multiply operations with bitwise operations. The proposed method can effectively reduce the gate count of inference engines by removing a part of computational costs from the hardware system. The architecture of MAC operations can calculate the inference results of BNNs efficiently with only 52% of hardware costs compared with the related works. To show that the inference engine can handle practical applications, two lightweight networks which combine the backbones of SegNeXt and the decoder of SparseInst for instance segmentation are also proposed. The output results of the lightweight networks are computed using only bitwise operations and add operations. The proposed inference engine has lower hardware costs than related works. The experimental results show that the proposed inference engine can handle the proposed instance‑segmentation networks and achieves higher accuracy than YOLACT on the "Person" category although the model size is 77.7× smaller compared with YOLACT.

Abstract:
Image segmentation is a vital task for providing human assistance and enhancing autonomy in our daily lives. In particular, RGB‑D segmentation‑leveraging both visual and depth cues‑has attracted increasing attention as it promises richer scene understanding than RGB‑only methods. However, most existing efforts have primarily focused on semantic segmentation and thus leave a critical gap. There is a relative scarcity of instance‑level RGB‑D segmentation datasets, which restricts current methods to broad category distinctions rather than fully capturing the fine‑grained details required for recognizing individual objects. To bridge this gap, we introduce three RGB‑D instance segmentation benchmarks, distinguished at the instance level. These datasets are versatile, supporting a wide range of applications from indoor navigation to robotic manipulation. In addition, we present an extensive evaluation of various baseline models on these benchmarks. This comprehensive analysis identifies both their strengths and shortcomings, guiding future work toward more robust, generalizable solutions. Finally, we propose a simple yet effective method for RGB‑D data integration. Extensive evaluations affirm the effectiveness of our approach, offering a robust framework for advancing toward more nuanced scene understanding.

Abstract:
Semi‑supervised (SS) semantic segmentation exploits both labeled and unlabeled images to overcome tedious and costly pixel‑level annotation problems. Pseudolabel supervision is one of the core approaches of training networks with both pseudo labels and ground‑truth labels. This work uses aleatoric or data uncertainty and energy based modeling in intersection‑union pseudo supervised network.The aleatoric uncertainty is modeling the inherent noise variations of the data in a network with two predictive branches. The per‑pixel variance parameter obtained from the network gives a quantitative idea about the data uncertainty. Moreover, energy‑based loss realizes the potential of generative modeling on the downstream SS segmentation task. The aleatoric and energy loss are applied in conjunction with pseudo‑intersection labels, pseudo‑union labels, and ground‑truth on the respective network branch. The comparative analysis with state‑of‑the‑art methods has shown improvement in performance metrics.

Abstract:
Reconstructing the intricate local morphology of neurons and their long‑range projecting axons can address many connectivity related questions in neuroscience. The main bottleneck in connectomics pipelines is correcting topological errors, as multiple entangled neuronal arbors is a challenging instance segmentation problem. More broadly, segmentation of curvilinear, filamentous structures continues to pose significant challenges. To address this problem, we extend the notion of simple points from digital topology to connected sets of voxels (i.e. supervoxels) and propose a topology‑aware neural network segmentation method with minimal computational overhead. We demonstrate its effectiveness on a new public dataset of 3‑d light microscopy images of mouse brains, along with the benchmark datasets DRIVE, ISBI12, and CrackTree.

Abstract:
Open‑vocabulary segmentation aims to identify and segment specific regions and objects based on text‑based descriptions. A common solution is to leverage powerful vision‑language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image‑level vision‑text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine‑grained pixel‑level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine‑grained pixel‑text alignment and category boundary supplementation. The core of FGAseg is a Pixel‑Level Alignment module that employs a cross‑modal attention mechanism and a text‑pixel alignment loss to refine the coarse‑grained alignment from CLIP, achieving finer‑grained pixel‑text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo‑masks during forward propagation and propose Category Information Supplementation module. These pseudo‑masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel‑level alignment and category boundary information, addressing key challenges in open‑vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open‑vocabulary semantic segmentation benchmarks.

Abstract:
Surgical video segmentation is critical for AI to interpret spatial‑temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero‑shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero‑shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and finetuning (dense, sparse) strategies, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero‑shot adaptability in structured scenarios (e.g., instrument segmentation, multi‑organ segmentation, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain‑specific artifacts. These results highlight future pathways to adaptive data‑efficient solutions for the surgical data science field.

Abstract:
The success rate of catheterization procedures is closely linked to the sensory data provided to the surgeon. Vision‑based deep learning models can deliver both tactile and visual information in a sensor‑free manner, while also being cost‑effective to produce. Given the complexity of these models for devices with limited computational resources, research has focused on force estimation and catheter segmentation separately. However, there is a lack of a comprehensive architecture capable of simultaneously segmenting the catheter from two different angles and estimating the applied forces in 3D. To bridge this gap, this work proposes a novel, lightweight, multi‑input, multi‑output encoder‑decoder‑based architecture. It is designed to segment the catheter from two points of view and concurrently measure the applied forces in the x, y, and z directions. This network processes two simultaneous X‑Ray images, intended to be fed by a biplane fluoroscopy system, showing a catheter's deflection from different angles. It uses two parallel sub‑networks with shared parameters to output two segmentation maps corresponding to the inputs. Additionally, it leverages stereo vision to estimate the applied forces at the catheter's tip in 3D. The architecture features two input channels, two classification heads for segmentation, and a regression head for force estimation through a single end‑to‑end architecture. The output of all heads was assessed and compared with the literature, demonstrating state‑of‑the‑art performance in both segmentation and force estimation. To the best of the authors' knowledge, this is the first time such a model has been proposed

Abstract:
Instance segmentation performance in remote sensing images (RSIs) is significantly affected by two issues: how to extract accurate boundaries of objects from remote imaging through the dynamic atmosphere, and how to integrate the mutual information of related object instances scattered over a vast spatial region. In this study, we propose a novel Shape Guided Transformer Network (SGTN) to accurately extract objects at the instance level. Inspired by the global contextual modeling capacity of the self‑attention mechanism, we propose an effective transformer encoder termed LSwin, which incorporates vertical and horizontal 1D global self‑attention mechanisms to obtain better global‑perception capacity for RSIs than the popular local‑shifted‑window based Swin Transformer. To achieve accurate instance mask segmentation, we introduce a shape guidance module (SGM) to emphasize the object boundary and shape information. The combination of SGM, which emphasizes the local detail information, and LSwin, which focuses on the global context relationships, achieve excellent RSI instance segmentation. Their effectiveness was validated through comprehensive ablation experiments. Especially, LSwin is proved better than the popular ResNet and Swin transformer encoder at the same level of efficiency. Compared to other instance segmentation methods, our SGTN achieves the highest average precision (AP) scores on two single‑class public datasets (WHU dataset and BITCC dataset) and a multi‑class public dataset (NWPU VHR‑10 dataset). Code will be available at http://gpcv.whu.edu.cn/data/.

Abstract:
Panoptic segmentation, which combines instance and semantic segmentation, has gained a lot of attention in autonomous vehicles, due to its comprehensive representation of the scene. This task can be applied for cameras and LiDAR sensors, but there has been a limited focus on combining both sensors to enhance image panoptic segmentation (PS). Although previous research has acknowledged the benefit of 3D data on camera‑based scene perception, no specific study has explored the influence of 3D data on image and video panoptic segmentation (VPS).This work seeks to introduce a feature fusion module that enhances PS and VPS by fusing LiDAR and image data for autonomous vehicles. We also illustrate that, in addition to this fusion, our proposed model, which utilizes two simple modifications, can further deliver even more high‑quality VPS without being trained on video data. The results demonstrate a substantial improvement in both the image and video panoptic segmentation evaluation metrics by up to 5 points.

Abstract:
Weakly‑supervised semantic segmentation (WSSS) has achieved remarkable progress using only image‑level labels. However, most existing WSSS methods focus on designing new network structures and loss functions to generate more accurate dense labels, overlooking the limitations imposed by fixed datasets, which can constrain performance improvements. We argue that more diverse trainable images provides WSSS richer information and help model understand more comprehensive semantic pattern. Therefore in this paper, we introduce a novel approach called Image Augmentation Agent (IAA) which shows that it is possible to enhance WSSS from data generation perspective. IAA mainly design an augmentation agent that leverages large language models (LLMs) and diffusion models to automatically generate additional images for WSSS. In practice, to address the instability in prompt generation by LLMs, we develop a prompt self‑refinement mechanism. It allow LLMs to re‑evaluate the rationality of generated prompts to produce more coherent prompts. Additionally, we insert an online filter into diffusion generation process to dynamically ensure the quality and balance of generated images. Experimental results show that our method significantly surpasses state‑of‑the‑art WSSS approaches on the PASCAL VOC 2012 and MS COCO 2014 datasets.

Abstract:
Large‑scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. Specifically, our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image‑to‑video generation model. We propose a data generation scheme to cover multiple video tasks based on instance‑level video segmentation datasets. Our model is trained by incorporating a mask prediction decoder head and optimizing a region‑aware loss to aid the encoder to preserve the original content while the generation model propagates the modified region. This novel design opens up new possibilities: In editing scenarios, GenProp allows substantial changes to an object's shape; for insertion, the inserted objects can exhibit independent motion; for removal, GenProp effectively removes effects like shadows and reflections from the whole video; for tracking, GenProp is capable of tracking objects and their associated effects together. Experiment results demonstrate the leading performance of our model in various video tasks, and we further provide in‑depth analyses of the proposed framework.

Abstract:
The application of Contrastive Language‑Image Pre‑training (CLIP) in Weakly Supervised Semantic Segmentation (WSSS) research powerful cross‑modal semantic understanding capabilities. Existing methods attempt to optimize input text prompts for improved alignment of images and text, by finely adjusting text prototypes to facilitate semantic matching. Nevertheless, given the modality gap between text and vision spaces, the text prototypes employed by these methods have not effectively established a close correspondence with pixel‑level vision features. In this work, our theoretical analysis indicates that the inherent modality gap results in misalignment of text and region features, and that this gap cannot be sufficiently reduced by minimizing contrast loss in CLIP. To mitigate the impact of the modality gap, we propose a Vision Prototype Learning (VPL) framework, by introducing more representative vision prototypes. The core of this framework is to learn class‑specific vision prototypes in vision space with the help of text prototypes, for capturing high‑quality localization maps. Moreover, we propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes, leading to more comprehensive and robust feature learning. Experimental results show that our proposed framework achieves state‑of‑the‑art performance on two benchmark datasets.

Abstract:
In this article, we present the Layered Semantic Graphs (LSG), a novel actionable hierarchical scene graph, fully integrated with a multi‑modal mission planner, the FLIE: A First‑Look based Inspection and Exploration planner. The novelty of this work stems from aiming to address the task of maintaining an intuitive and multi‑resolution scene representation, while simultaneously offering a tractable foundation for planning and scene understanding during an ongoing inspection mission of apriori unknown targets‑of‑interest in an unknown environment. The proposed LSG scheme is composed of locally nested hierarchical graphs, at multiple layers of abstraction, with the abstract concepts grounded on the functionality of the integrated FLIE planner. Furthermore, LSG encapsulates real‑time semantic segmentation models that offer extraction and localization of desired semantic elements within the hierarchical representation. This extends the capability of the inspection planner, which can then leverage LSG to make an informed decision to inspect a particular semantic of interest. We also emphasize the hierarchical and semantic path‑planning capabilities of LSG, which could extend inspection missions by improving situational awareness for human operators in an unknown environment. The validity of the proposed scheme is proven through extensive evaluations of the proposed architecture in simulations, as well as experimental field deployments on a Boston Dynamics Spot quadruped robot in urban outdoor environment settings.

Abstract:
Deep learning (DL)‑based point cloud segmentation is essential for understanding built environment. Despite synthetic point clouds (SPC) having the potential to compensate for data shortage, how synthetic color and mixing proportion impact DL‑based segmentation remains a long‑standing question. Therefore, this paper addresses this question with extensive experiments by introducing: 1) method to generate SPC with real colors and uniform colors from BIM, and 2) enhanced benchmarks for better performance evaluation. Experiments on DL models including PointNet, PointNet++, and DGCNN show that model performance on SPC with real colors outperforms that on SPC with uniform colors by 8.2 % + on both OA and mIoU. Furthermore, a higher than 70 % mixing proportion of SPC usually leads to better performance. And SPC can replace real ones to train a DL model for detecting large and flat building elements. Overall, this paper unveils the performance‑improving mechanism of SPC and brings new insights to boost SPC's value (for building large models for point clouds).

Abstract:
Recently, Visual Foundation Models (VFMs) have shown a remarkable generalization performance in 3D perception tasks. However, their effectiveness in large‑scale outdoor datasets remains constrained by the scarcity of accurate supervision signals, the extensive noise caused by variable outdoor conditions, and the abundance of unknown objects. In this work, we propose a novel label‑free learning method, Adaptive Label Correction (AdaCo), for 3D semantic segmentation. AdaCo first introduces the Cross‑modal Label Generation Module (CLGM), providing cross‑modal supervision with the formidable interpretive capabilities of the VFMs. Subsequently, AdaCo incorporates the Adaptive Noise Corrector (ANC), updating and adjusting the noisy samples within this supervision iteratively during training. Moreover, we develop an Adaptive Robust Loss (ARL) function to modulate each sample's sensitivity to noisy supervision, preventing potential underfitting issues associated with robust loss. Our proposed AdaCo can effectively mitigate the performance limitations of label‑free learning networks in 3D semantic segmentation tasks. Extensive experiments on two outdoor benchmark datasets highlight the superior performance of our method.

Abstract:
Open‑world 3D scene understanding is a critical challenge that involves recognizing and distinguishing diverse objects and categories from 3D data, such as point clouds, without relying on manual annotations. Traditional methods struggle with this open‑world task, especially due to the limitations of constructing extensive point cloud‑text pairs and handling multimodal data effectively. In response to these challenges, we present UniPLV, a robust framework that unifies point clouds, images, and text within a single learning paradigm for comprehensive 3D scene understanding. UniPLV leverages images as a bridge to co‑embed 3D points with pre‑aligned images and text in a shared feature space, eliminating the need for labor‑intensive point cloud‑text pair crafting. Our framework achieves precise multimodal alignment through two innovative strategies: (i) Logit and feature distillation modules between images and point clouds to enhance feature coherence; (ii) A vision‑point matching module that implicitly corrects 3D semantic predictions affected by projection inaccuracies from points to pixels. To further boost performance, we implement four task‑specific losses alongside a two‑stage training strategy. Extensive experiments demonstrate that UniPLV significantly surpasses state‑of‑the‑art methods, with average improvements of 15.6% and 14.8% in semantic segmentation for Base‑Annotated and Annotation‑Free tasks, respectively. These results underscore UniPLV's efficacy in pushing the boundaries of open‑world 3D scene understanding. We will release the code to support future research and development.

Abstract:
Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To this end, we propose a Language‑Embedded Surface Field (LangSurf), which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical‑Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine‑grained language features in different hierarchies. Extensive experiments on open‑vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous state‑of‑the‑art method LangSplat by a large margin. As shown in Fig. 1, our method is capable of segmenting objects in 3D space, thus boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments. https://langsurf.github.io.

Abstract:
Deep neural networks have shown outstanding performance in computer vision tasks such as semantic segmentation and have defined the state‑of‑the‑art. However, these segmentation models are trained on a closed and predefined set of semantic classes, which leads to significant prediction failures in open‑world scenarios on unknown objects. As this behavior prevents the application in safety‑critical applications such as automated driving, the detection and segmentation of these objects from outside their predefined semantic space (out‑of‑distribution (OOD) objects) is of the utmost importance. In this work, we present a multi‑scale OOD segmentation method that exploits the confidence information of a foreground‑background segmentation model. While semantic segmentation models are trained on specific classes, this restriction does not apply to foreground‑background methods making them suitable for OOD segmentation. We consider the per pixel confidence score of the model prediction which is close to 1 for a pixel in a foreground object. By aggregating these confidence values for different sized patches, objects of various sizes can be identified in a single image. Our experiments show improved performance of our method in OOD segmentation compared to comparable baselines in the SegmentMeIfYouCan benchmark.

Abstract:
In this paper, we address the challenging modality‑agnostic semantic segmentation (MaSS), aiming at centering the value of every modality at every feature granularity. Training with all available visual modalities and effectively fusing an arbitrary combination of them is essential for robust multi‑modal fusion in semantic segmentation, especially in real‑world scenarios, yet remains less explored to date. Existing approaches often place RGB at the center, treating other modalities as secondary, resulting in an asymmetric architecture. However, RGB alone can be limiting in scenarios like nighttime, where modalities such as event data excel. Therefore, a resilient fusion model must dynamically adapt to each modality's strengths while compensating for weaker inputs.To this end, we introduce the MAGIC++ framework, which comprises two key plug‑and‑play modules for effective multi‑modal fusion and hierarchical modality selection that can be equipped with various backbone models. Firstly, we introduce a multi‑modal interaction module to efficiently process features from the input multi‑modal batches and extract complementary scene information with channel‑wise and spatial‑wise guidance. On top, a unified multi‑scale arbitrary‑modal selection module is proposed to utilize the aggregated features as the benchmark to rank the multi‑modal features based on the similarity scores at hierarchical feature spaces. This way, our method can eliminate the dependence on RGB modality at every feature granularity and better overcome sensor failures and environmental noises while ensuring the segmentation performance. Under the common multi‑modal setting, our method achieves state‑of‑the‑art performance on both real‑world and synthetic benchmarks. Moreover, our method is superior in the novel modality‑agnostic setting, where it outperforms prior arts by a large margin.

Abstract:
Semantic segmentation requires extensive pixel‑level annotation, motivating unsupervised domain adaptation (UDA) to transfer knowledge from labelled source domains to unlabelled or weakly labelled target domains. One of the most efficient strategies involves using synthetic datasets generated within controlled virtual environments, such as video games or traffic simulators, which can automatically generate pixel‑level annotations. However, even when such datasets are available, learning a well‑generalised representation that captures both domains remains challenging, owing to probabilistic and geometric discrepancies between the virtual world and real‑world imagery. This work introduces a semantic segmentation method based on latent diffusion models, termed Inter‑Coder Connected Latent Diffusion (ICCLD), alongside an unsupervised domain adaptation approach. The model employs an inter‑coder connection to enhance contextual understanding and preserve fine details, while adversarial learning aligns latent feature distributions across domains during the latent diffusion process. Experiments on GTA5, Synthia, and Cityscapes demonstrate that ICCLD outperforms state‑of‑the‑art UDA methods, achieving mIoU scores of 74.4 (GTA5\rightarrowCityscapes) and 67.2 (Synthia\rightarrowCityscapes).

Abstract:
Current agriculture and farming industries are able to reap advancements in robotics and automation technology to harvest fruits and vegetables using robots with adaptive grasping forces based on the compliance or softness of the fruit or vegetable. A successful operation depends on using a gripper that can adapt to the mechanical properties of the crops. This paper proposes a new robotic harvesting approach for tomato fruit using a novel hybrid gripper with a soft caging effect. It uses its six flexible passive auxetic structures based on fingers with rigid outer exoskeletons for good gripping strength and shape conformability. The gripper is actuated through a scotch‑yoke mechanism using a servo motor. To perform tomato picking operations through a gripper, a vision system based on a depth camera and RGB camera implements the fruit identification process. It incorporates deep learning‑based keypoint detection of the tomato's pedicel and body for localization in an occluded and variable ambient light environment and semantic segmentation of ripe and unripe tomatoes. In addition, robust trajectory planning of the robotic arm based on input from the vision system and control of robotic gripper movements are carried out for secure tomato handling. The tunable grasping force of the gripper would allow the robotic handling of fruits with a broad range of compliance.

Abstract:
Existing infrared and visible (IR‑VIS) methods inherit the general representations of Pre‑trained Visual Models (PVMs) to facilitate complementary learning. However, our analysis indicates that under the full fine‑tuning paradigm, the feature space becomes highly constrained and low‑ranked, which has been proven to seriously impair generalization. One remedy is to freeze the parameters, which preserves pretrained knowledge and helps maintain feature diversity. To this end, we propose IV‑tuning, to parameter‑efficiently harness PVMs for various IR‑VIS downstream tasks, including salient object detection, semantic segmentation, and object detection. Extensive experiments across various settings demonstrate that IV‑tuning outperforms previous state‑of‑the‑art methods, and exhibits superior generalization and scalability. Remarkably, with only a single backbone, IV‑tuning effectively facilitates the complementary learning of infrared and visible modalities with merely 3% trainable backbone parameters, and achieves superior computational efficiency compared to conventional IR‑VIS paradigms.

Abstract:
Vision transformers dominate image processing tasks due to their superior performance. However, the quadratic complexity of self‑attention limits the scalability of these systems and their deployment on resource‑constrained devices. State Space Models (SSMs) have emerged as a solution by introducing a linear recurrence mechanism, which reduces the complexity of sequence modeling from quadratic to linear. Recently, SSMs have been extended to high‑resolution vision tasks. Nonetheless, the linear recurrence mechanism struggles to fully utilize matrix multiplication units on modern hardware, resulting in a computational bottleneck. We address this issue by introducing VMeanba, a training‑free compression method that eliminates the channel dimension in SSMs using mean operations. Our key observation is that the output activations of SSM blocks exhibit low variances across channels. Our VMeanba leverages this property to optimize computation by averaging activation maps across the channel to reduce the computational overhead without compromising accuracy. Evaluations on image classification and semantic segmentation tasks demonstrate that VMeanba achieves up to a 1.12x speedup with less than a 3% accuracy loss. When combined with 40% unstructured pruning, the accuracy drop remains under 3%.

Abstract:
This paper introduces a novel synthetic dataset that captures urban scenes under a variety of weather conditions, providing pixel‑perfect, ground‑truth‑aligned images to facilitate effective feature alignment across domains. Additionally, we propose a method for domain adaptation and generalization that takes advantage of the multiple versions of each scene, enforcing feature consistency across different weather scenarios. Our experimental results demonstrate the impact of our dataset in improving performance across several alignment metrics, addressing key challenges in domain adaptation and generalization for segmentation tasks. This research also explores critical aspects of synthetic data generation, such as optimizing the balance between the volume and variability of generated images to enhance segmentation performance. Ultimately, this work sets forth a new paradigm for synthetic data generation and domain adaptation.

Abstract:
Self‑supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision‑language models such as CLIP, self‑supervised visual features are not readily aligned with language, hindering their adoption in open‑vocabulary tasks. Our method, named dino.txt, unlocks this new ability for DINOv2, a widely used self‑supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. With these, we successfully train a CLIP‑like model with only a fraction of the computational cost compared to CLIP while achieving state‑of‑the‑art results in zero‑shot classification and open‑vocabulary semantic segmentation.

Abstract:
The advances in the development of Facilitative Playbacks extracted from High‑Speed videoendoscopic sequences of the vocal folds are hindered by a notable lack of publicly available datasets annotated with the semantic segmentations corresponding to the area of the glottal gap. This fact also limits the reproducibility and further exploration of existing research in this field. To address this gap, GIRAFE is a data repository designed to facilitate the development of advanced techniques for the semantic segmentation, analysis, and fast evaluation of High‑Speed videoendoscopic sequences of the vocal folds. The repository includes 65 high‑speed videoendoscopic recordings from a cohort of 50 patients (30 female, 20 male). The dataset comprises 15 recordings from healthy controls, 26 from patients with diagnosed voice disorders, and 24 with an unknown health condition. All of them were manually annotated by an expert, including the masks corresponding to the semantic segmentation of the glottal gap. The repository is also complemented with the automatic segmentation of the glottal area using different state‑of‑the‑art approaches. This data set has already supported several studies, which demonstrates its usefulness for the development of new glottal gap segmentation algorithms from High‑Speed‑Videoendoscopic sequences to improve or create new Facilitative Playbacks. Despite these advances and others in the field, the broader challenge of performing an accurate and completely automatic semantic segmentation method of the glottal area remains open.

Abstract:
We present the approaches and contributions of the winning team NimbRo@Home at the RoboCup@Home 2024 competition in the Open Platform League held in Eindhoven, NL. Further, we describe our hardware setup and give an overview of the results for the task stages and the final demonstration. For this year's competition, we put a special emphasis on open‑vocabulary object segmentation and grasping approaches that overcome the labeling overhead of supervised vision approaches, commonly used in RoboCup@Home. We successfully demonstrated that we can segment and grasp non‑labeled objects by text descriptions. Further, we extensively employed LLMs for natural language understanding and task planning. Throughout the competition, our approaches showed robustness and generalization capabilities. A video of our performance can be found online.

Abstract:
Due to its efficiency, Post‑Training Quantization (PTQ) has been widely adopted for compressing Vision Transformers (ViTs). However, when quantized into low‑bit representations, there is often a significant performance drop compared to their full‑precision counterparts. To address this issue, reconstruction methods have been incorporated into the PTQ framework to improve performance in low‑bit quantization settings. Nevertheless, existing related methods predefine the reconstruction granularity and seldom explore the progressive relationships between different reconstruction granularities, which leads to sub‑optimal quantization results in ViTs. To this end, in this paper, we propose a Progressive Fine‑to‑Coarse Reconstruction (PFCR) method for accurate PTQ, which significantly improves the performance of low‑bit quantized vision transformers. Specifically, we define multi‑head self‑attention and multi‑layer perceptron modules along with their shortcuts as the finest reconstruction units. After reconstructing these two fine‑grained units, we combine them to form coarser blocks and reconstruct them at a coarser granularity level. We iteratively perform this combination and reconstruction process, achieving progressive fine‑to‑coarse reconstruction. Additionally, we introduce a Progressive Optimization Strategy (POS) for PFCR to alleviate the difficulty of training, thereby further enhancing model performance. Experimental results on the ImageNet dataset demonstrate that our proposed method achieves the best Top‑1 accuracy among state‑of‑the‑art methods, particularly attaining 75.61% for 3‑bit quantized ViT‑B in PTQ. Besides, quantization results on the COCO dataset reveal the effectiveness and generalization of our proposed method on other computer vision tasks like object detection and instance segmentation.

Abstract:
Spiking Neural Networks (SNNs) have a low‑power advantage but perform poorly in image segmentation tasks. The reason is that directly converting neural networks with complex architectural designs for segmentation tasks into spiking versions leads to performance degradation and non‑convergence. To address this challenge, we first identify the modules in the architecture design that lead to the severe reduction in spike firing, make targeted improvements, and propose Spike2Former architecture. Second, we propose normalized integer spiking neurons to solve the training stability problem of SNNs with complex architectures. We set a new state‑of‑the‑art for SNNs in various semantic segmentation datasets, with a significant improvement of +12.7% mIoU and 5.0 efficiency on ADE20K, +14.3% mIoU and 5.2 efficiency on VOC2012, and +9.1% mIoU and 6.6 efficiency on CityScapes.

Abstract:
In this paper, we propose a novel approach to minimize the inference delay in semantic segmentation using split learning (SL), tailored to the needs of real‑time computer vision (CV) applications for resource‑constrained devices. Semantic segmentation is essential for applications such as autonomous vehicles and smart city infrastructure, but faces significant latency challenges due to high computational and communication loads. Traditional centralized processing methods are inefficient for such scenarios, often resulting in unacceptable inference delays. SL offers a promising alternative by partitioning deep neural networks (DNNs) between edge devices and a central server, enabling localized data processing and reducing the amount of data required for transmission. Our contribution includes the joint optimization of bandwidth allocation, cut layer selection of the edge devices' DNN, and the central server's processing resource allocation. We investigate both parallel and serial data processing scenarios and propose low‑complexity heuristic solutions that maintain near‑optimal performance while reducing computational requirements. Numerical results show that our approach effectively reduces inference delay, demonstrating the potential of SL for improving real‑time CV applications in dynamic, resource‑constrained environments.

Abstract:
Recent advancements in Vision Transformers (ViT) have demonstrated exceptional results in various visual recognition tasks, owing to their ability to capture long‑range dependencies in images through self‑attention mechanisms. However, the complex nature of ViT models requires robust explainability methods to unveil their decision‑making processes. Explainable Artificial Intelligence (XAI) plays a crucial role in improving model transparency and trustworthiness by providing insights into model predictions. Current approaches to ViT explainability, based on visualization techniques such as Layer‑wise Relevance Propagation (LRP) and gradient‑based methods, have shown promising but sometimes limited results. In this study, we explore a hybrid approach that mixes multiple explainability techniques to overcome these limitations and enhance the interpretability of ViT models. Our experiments reveal that this hybrid approach significantly improves the interpretability of ViT models compared to individual methods. We also introduce modifications to existing techniques, such as using geometric mean for mixing, which demonstrates notable results in object segmentation tasks. To quantify the explainability gain, we introduced a novel post‑hoc explainability measure by applying the Pigeonhole principle. These findings underscore the importance of refining and optimizing explainability methods for ViT models, paving the way to reliable XAI‑based segmentations.

Abstract:
The visual understanding are often approached from 3 granular levels: image, patch and pixel. Visual Tokenization, trained by self‑supervised reconstructive learning, compresses visual data by codebook in patch‑level with marginal information loss, but the visual tokens does not have semantic meaning. Open Vocabulary semantic segmentation benefits from the evolving Vision‑Language models (VLMs) with strong image zero‑shot capability, but transferring image‑level to pixel‑level understanding remains an imminent challenge. In this paper, we treat segmentation as tokenizing pixels and study a united perceptual and semantic token compression for all granular understanding and consequently facilitate open vocabulary semantic segmentation. Referring to the cognitive process of pretrained VLM where the low‑level features are progressively composed to high‑level semantics, we propose Feature Pyramid Tokenization (PAT) to cluster and represent multi‑resolution feature by learnable codebooks and then decode them by joint learning pixel reconstruction and semantic segmentation. We design loosely coupled pixel and semantic learning branches. The pixel branch simulates bottom‑up composition and top‑down visualization of codebook tokens, while the semantic branch collectively fuse hierarchical codebooks as auxiliary segmentation guidance. Our experiments show that PAT enhances the semantic intuition of VLM feature pyramid, improves performance over the baseline segmentation model and achieves competitive performance on open vocabulary semantic segmentation benchmark. Our model is parameter‑efficient for VLM integration and flexible for the independent tokenization. We hope to give inspiration not only on improving segmentation but also on semantic visual token utilization.

Abstract:
Weakly Supervised Semantic Segmentation (WSSS), which leverages image‑level labels, has garnered significant attention due to its cost‑effectiveness. The previous methods mainly strengthen the inter‑class differences to avoid class semantic ambiguity which may lead to erroneous activation. However, they overlook the positive function of some shared information between similar classes. Categories within the same cluster share some similar features. Allowing the model to recognize these features can further relieve the semantic ambiguity between these classes. To effectively identify and utilize this shared information, in this paper, we introduce a novel WSSS framework called Prompt Categories Clustering (PCC). Specifically, we explore the ability of Large Language Models (LLMs) to derive category clusters through prompts. These clusters effectively represent the intrinsic relationships between categories. By integrating this relational information into the training network, our model is able to better learn the hidden connections between categories. Experimental results demonstrate the effectiveness of our approach, showing its ability to enhance performance on the PASCAL VOC 2012 dataset and surpass existing state‑of‑the‑art methods in WSSS.

Abstract:
Federated learning (FL) commonly assumes that the server or some clients have labeled data, which is often impractical due to annotation costs and privacy concerns. Addressing this problem, we focus on a source‑free domain adaptation task, where (1) the server holds a pre‑trained model on labeled source domain data, (2) clients possess only unlabeled data from various target domains, and (3) the server and clients cannot access the source data in the adaptation phase. This task is known as Federated source‑Free Domain Adaptation (FFREEDA). Specifically, we focus on classification tasks, while the previous work solely studies semantic segmentation. Our contribution is the novel Federated learning with Weighted Cluster Aggregation (FedWCA) method, designed to mitigate both domain shifts and privacy concerns with only unlabeled data. FedWCA comprises three phases: private and parameter‑free clustering of clients to obtain domain‑specific global models on the server, weighted aggregation of the global models for the clustered clients, and local domain adaptation with pseudo‑labeling. Experimental results show that FedWCA surpasses several existing methods and baselines in FFREEDA, establishing its effectiveness and practicality.

Abstract:
'A trustworthy representation of uncertainty is desirable and should be considered as a key feature of any machine learning method' (Huellermeier and Waegeman, 2021). This conclusion of Huellermeier et al. underpins the importance of calibrated uncertainties. Since AI‑based algorithms are heavily impacted by dataset shifts, the automotive industry needs to safeguard its system against all possible contingencies. One important but often neglected dataset shift is caused by optical aberrations induced by the windshield. For the verification of the perception system performance, requirements on the AI performance need to be translated into optical metrics by a bijective mapping. Given this bijective mapping it is evident that the optical system characteristics add additional information about the magnitude of the dataset shift. As a consequence, we propose to incorporate a physical inductive bias into the neural network calibration architecture to enhance the robustness and the trustworthiness of the AI target application, which we demonstrate by using a semantic segmentation task as an example. By utilizing the Zernike coefficient vector of the optical system as a physical prior we can significantly reduce the mean expected calibration error in case of optical aberrations. As a result, we pave the way for a trustworthy uncertainty representation and for a holistic verification strategy of the perception chain.

Abstract:
Neural radiance fields are an emerging 3D scene representation and recently even been extended to learn features for scene understanding by distilling open‑vocabulary features from vision‑language models. However, current method primarily focus on object‑centric representations, supporting object segmentation or detection, while understanding semantic relationships between objects remains largely unexplored. To address this gap, we propose RelationField, the first method to extract inter‑object relationships directly from neural radiance fields. RelationField represents relationships between objects as pairs of rays within a neural radiance field, effectively extending its formulation to include implicit relationship queries. To teach RelationField complex, open‑vocabulary relationships, relationship knowledge is distilled from multi‑modal LLMs. To evaluate RelationField, we solve open‑vocabulary 3D scene graph generation tasks and relationship‑guided instance segmentation, achieving state‑of‑the‑art performance in both tasks. See the project website at https://relationfield.github.io.

Abstract:
Discussions of minimum parking requirement policies often include maps of parking lots, which are time consuming to construct manually. Open source datasets for such parking lots are scarce, particularly for US cities. This paper introduces the idea of using Near‑Infrared (NIR) channels as input and several post‑processing techniques to improve the prediction of off‑street surface parking lots using satellite imagery. We constructed two datasets with 12,617 image‑mask pairs each: one with 3‑channel (RGB) and another with 4‑channel (RGB + NIR). The datasets were used to train five deep learning models (OneFormer, Mask2Former, SegFormer, DeepLabV3, and FCN) for semantic segmentation, classifying images to differentiate between parking and non‑parking pixels. Our results demonstrate that the NIR channel improved accuracy because parking lots are often surrounded by grass, even though the NIR channel needed to be upsampled from a lower resolution. Post‑processing including eliminating erroneous holes, simplifying edges, and removing road and building footprints further improved the accuracy. Best model, OneFormer trained on 4‑channel input and paired with post‑processing techniques achieves a mean Intersection over Union (mIoU) of 84.9 percent and a pixel‑wise accuracy of 96.3 percent.

Abstract:
Perception is a key building block of autonomously acting vision systems such as autonomous vehicles. It is crucial that these systems are able to understand their surroundings in order to operate safely and robustly. Additionally, autonomous systems deployed in unconstrained real‑world scenarios must be able of dealing with novel situations and object that have never been seen before. In this article, we tackle the problem of open‑world panoptic segmentation, i.e., the task of discovering new semantic categories and new object instances at test time, while enforcing consistency among the categories that we incrementally discover. We propose Con2MAV, an approach for open‑world panoptic segmentation that extends our previous work, ContMAV, which was developed for open‑world semantic segmentation. Through extensive experiments across multiple datasets, we show that our model achieves state‑of‑the‑art results on open‑world segmentation tasks, while still performing competitively on the known categories. We will open‑source our implementation upon acceptance. Additionally, we propose PANIC (Panoptic ANomalies In Context), a benchmark for evaluating open‑world panoptic segmentation in autonomous driving scenarios. This dataset, recorded with a multi‑modal sensor suite mounted on a car, provides high‑quality, pixel‑wise annotations of anomalous objects at both semantic and instance level. Our dataset contains 800 images, with more than 50 unknown classes, i.e., classes that do not appear in the training set, and 4000 object instances, making it an extremely challenging dataset for open‑world segmentation tasks in the autonomous driving scenario. We provide competitions for multiple open‑world tasks on a hidden test set. Our dataset and competitions are available at https://www.ipb.uni‑bonn.de/data/panic.

Abstract:
Semantic segmentation and 3D reconstruction are two fundamental tasks in remote sensing, typically treated as separate or loosely coupled tasks. Despite attempts to integrate them into a unified network, the constraints between the two heterogeneous tasks are not explicitly modeled, since the pioneering studies either utilize a loosely coupled parallel structure or engage in only implicit interactions, failing to capture the inherent connections. In this work, we explore the connections between the two tasks and propose a new network that imposes semantic constraints on the stereo matching task, both implicitly and explicitly. Implicitly, we transform the traditional parallel structure to a new cascade structure termed Semantic‑Guided Cascade structure, where the deep features enriched with semantic information are utilized for the computation of initial disparity maps, enhancing semantic guidance. Explicitly, we propose a Semantic Selective Refinement (SSR) module and a Left‑Right Semantic Consistency (LRSC) module. The SSR refines the initial disparity map under the guidance of the semantic map. The LRSC ensures semantic consistency between two views via reducing the semantic divergence after transforming the semantic map from one view to the other using the disparity map. Experiments on the US3D and WHU datasets demonstrate that our method achieves state‑of‑the‑art performance for both semantic segmentation and stereo matching.

Abstract:
In recent years, semantic segmentation has flourished in various applications. However, the high computational cost remains a significant challenge that hinders its further adoption. The filter pruning method for structured network slimming offers a direct and effective solution for the reduction of segmentation networks. Nevertheless, we argue that most existing pruning methods, originally designed for image classification, overlook the fact that segmentation is a location‑sensitive task, which consequently leads to their suboptimal performance when applied to segmentation networks. To address this issue, this paper proposes a novel approach, denoted as Spatial‑aware Information Redundancy Filter Pruning~(SIRFP), which aims to reduce feature redundancy between channels. First, we formulate the pruning process as a maximum edge weight clique problem~(MEWCP) in graph theory, thereby minimizing the redundancy among the remaining features after pruning. Within this framework, we introduce a spatial‑aware redundancy metric based on feature maps, thus endowing the pruning process with location sensitivity to better adapt to pruning segmentation networks. Additionally, based on the MEWCP, we propose a low computational complexity greedy strategy to solve this NP‑hard problem, making it feasible and efficient for structured pruning. To validate the effectiveness of our method, we conducted extensive comparative experiments on various challenging datasets. The results demonstrate the superior performance of SIRFP for semantic segmentation tasks.

Abstract:
Recently, developing unified medical image segmentation models gains increasing attention, especially with the advent of the Segment Anything Model (SAM). SAM has shown promising binary segmentation performance in natural domains, however, transferring it to the medical domain remains challenging, as medical images often possess substantial inter‑category overlaps. To address this, we propose the SEmantic‑Guided SAM (SEG‑SAM), a unified medical segmentation model that incorporates semantic medical knowledge to enhance medical segmentation performance. First, to avoid the potential conflict between binary and semantic predictions, we introduce a semantic‑aware decoder independent of SAM's original decoder, specialized for both semantic segmentation on the prompted object and classification on unprompted objects in images. To further enhance the model's semantic understanding, we solicit key characteristics of medical categories from large language models and incorporate them into SEG‑SAM through a text‑to‑vision semantic module, adaptively transferring the language information into the visual segmentation task. In the end, we introduce the cross‑mask spatial alignment strategy to encourage greater overlap between the predicted masks from SEG‑SAM's two decoders, thereby benefiting both predictions. Extensive experiments demonstrate that SEG‑SAM outperforms state‑of‑the‑art SAM‑based methods in unified binary medical segmentation and task‑specific methods in semantic medical segmentation, showcasing promising results and potential for broader medical applications.

Abstract:
Few‑shot segmentation is the problem of learning to identify specific types of objects (e.g., airplanes) in images from a small set of labeled reference images. The current state of the art is driven by resource‑intensive construction of models for every new domain‑specific application. Such models must be trained on enormous labeled datasets of unrelated objects (e.g., cars, trains, animals) so that their ``knowledge'' can be transferred to new types of objects. In this paper, we show how to leverage existing vision foundation models (VFMs) to reduce the incremental cost of creating few‑shot segmentation models for new domains. Specifically, we introduce SAMIC, a small network that learns how to prompt VFMs in order to segment new types of objects in domain‑specific applications. SAMIC enables any task to be approached as a few‑shot learning problem. At 2.6 million parameters, it is 94% smaller than the leading models (e.g., having ResNet 101 backbone with 45+ million parameters). Even using 1/5th of the training data provided by one‑shot benchmarks, SAMIC is competitive with, or sets the state of the art, on a variety of few‑shot and semantic segmentation datasets including COCO‑20^i, Pascal‑5^i, PerSeg, FSS‑1000, and NWPU VHR‑10.

Abstract:
Previous studies showed that image datasets lacking geographic diversity can lead to biased performance in models trained on them. While earlier work studied general‑purpose image datasets (e.g., ImageNet) and simple tasks like image recognition, we investigated geo‑biases in real‑world driving datasets on a more complex task: instance segmentation. We examined if instance segmentation models trained on European driving scenes (Eurocentric models) are geo‑biased. Consistent with previous work, we found that Eurocentric models were geo‑biased. Interestingly, we found that geo‑biases came from classification errors rather than localization errors, with classification errors alone contributing 10‑90% of the geo‑biases in segmentation and 19‑88% of the geo‑biases in detection. This showed that while classification is geo‑biased, localization (including detection and segmentation) is geographically robust. Our findings show that in region‑specific models (e.g., Eurocentric models), geo‑biases from classification errors can be significantly mitigated by using coarser classes (e.g., grouping car, bus, and truck as 4‑wheeler).

Abstract:
We propose SAM‑IF, a novel method for incremental few‑shot instance segmentation leveraging the Segment Anything Model (SAM). SAM‑IF addresses the challenges of class‑agnostic instance segmentation by introducing a multi‑class classifier and fine‑tuning SAM to focus on specific target objects. To enhance few‑shot learning capabilities, SAM‑IF employs a cosine‑similarity‑based classifier, enabling efficient adaptation to novel classes with minimal data. Additionally, SAM‑IF supports incremental learning by updating classifier weights without retraining the decoder. Our method achieves competitive but more reasonable results compared to existing approaches, particularly in scenarios requiring specific object segmentation with limited labeled data.

Abstract:
Vision transformers (ViTs) have dominated computer vision in recent years. However, ViTs are computationally expensive and not well suited for mobile devices; this led to the prevalence of convolutional neural network (CNN) and ViT‑based hybrid models for mobile vision applications. Recently, Vision GNN (ViG) and CNN hybrid models have also been proposed for mobile vision tasks. However, all of these methods remain slower compared to pure CNN‑based models. In this work, we propose Multi‑Level Dilated Convolutions to devise a purely CNN‑based mobile backbone. Using Multi‑Level Dilated Convolutions allows for a larger theoretical receptive field than standard convolutions. Different levels of dilation also allow for interactions between the short‑range and long‑range features in an image. Experiments show that our proposed model outperforms state‑of‑the‑art (SOTA) mobile CNN, ViT, ViG, and hybrid architectures in terms of accuracy and/or speed on image classification, object detection, instance segmentation, and semantic segmentation. Our fastest model, RapidNet‑Ti, achieves 76.3% top‑1 accuracy on ImageNet‑1K with 0.9 ms inference latency on an iPhone 13 mini NPU, which is faster and more accurate than MobileNetV2x1.4 (74.7% top‑1 with 1.0 ms latency). Our work shows that pure CNN architectures can beat SOTA hybrid and ViT models in terms of accuracy and speed when designed properly.

Abstract:
Open‑set 3D segmentation represents a major point of interest for multiple downstream robotics and augmented/virtual reality applications. We present a decoupled 3D segmentation pipeline to ensure modularity and adaptability to novel 3D representations as well as semantic segmentation foundation models. We first reconstruct a scene with 3D Gaussians and learn class‑agnostic features through contrastive supervision from a 2D instance proposal network. These 3D features are then clustered to form coarse object‑ or part‑level masks. Finally, we match each 3D cluster to class‑aware masks predicted by a 2D open‑vocabulary segmentation model, assigning semantic labels without retraining the 3D representation. Our decoupled design (1) provides a plug‑and‑play interface for swapping different 2D or 3D modules, (2) ensures multi‑object instance segmentation at no extra cost, and (3) leverages rich 3D geometry for robust scene understanding. We evaluate on synthetic and real‑world indoor datasets, demonstrating improved performance over comparable NeRF‑based pipelines on mIoU and mAcc, particularly for challenging or long‑tail classes. We also show how varying the 2D backbone affects the final segmentation, highlighting the modularity of our framework. These results confirm that decoupling 3D mask proposal and semantic classification can deliver flexible, efficient, and open‑vocabulary 3D segmentation.

Abstract:
2D images and 3D point clouds are foundational data types for multimedia applications, including real‑time video analysis, augmented reality (AR), and 3D scene understanding. Class‑incremental semantic segmentation (CSS) requires incrementally learning new semantic categories while retaining prior knowledge. Existing methods typically rely on computationally expensive training based on stochastic gradient descent, employing complex regularization or exemplar replay. However, stochastic gradient descent‑based approaches inevitably update the model's weights for past knowledge, leading to catastrophic forgetting, a problem exacerbated by pixel/point‑level granularity. To address these challenges, we propose CFSSeg, a novel exemplar‑free approach that leverages a closed‑form solution, offering a practical and theoretically grounded solution for continual semantic segmentation tasks. This eliminates the need for iterative gradient‑based optimization and storage of past data, requiring only a single pass through new samples per step. It not only enhances computational efficiency but also provides a practical solution for dynamic, privacy‑sensitive multimedia environments. Extensive experiments on 2D and 3D benchmark datasets such as Pascal VOC2012, S3DIS, and ScanNet demonstrate CFSSeg's superior performance.

Abstract:
Deep neural networks (DNNs) are a contemporary solution for semantic segmentation and are usually trained to operate on a predefined closed set of classes. In open‑set environments, it is possible to encounter semantically unknown objects or anomalies. Road driving is an example of such an environment in which, from a safety standpoint, it is important to ensure that a DNN indicates it is operating outside of its learned semantic domain. One possible approach to anomaly segmentation is entropy maximization, which is paired with a logistic regression based post‑processing step called meta classification, which is in turn used to improve the reliability of detection of anomalous pixels. We propose to substitute the logistic regression meta classifier with a more expressive lightweight fully connected neural network. We analyze advantages and drawbacks of the proposed neural network meta classifier and demonstrate its better performance over logistic regression. We also introduce the concept of informative out‑of‑distribution examples which we show to improve training results when using entropy maximization in practice. Finally, we discuss the loss of interpretability and show that the behavior of logistic regression and neural network is strongly correlated.

Abstract:
The rapid advancement of deep learning has intensified the need for comprehensive data for use by autonomous driving algorithms. High‑quality datasets are crucial for the development of effective data‑driven autonomous driving solutions. Next‑generation autonomous driving datasets must be multimodal, incorporating data from advanced sensors that feature extensive data coverage, detailed annotations, and diverse scene representation. To address this need, we present OmniHD‑Scenes, a large‑scale multimodal dataset that provides comprehensive omnidirectional high‑definition data. The OmniHD‑Scenes dataset combines data from 128‑beam LiDAR, six cameras, and six 4D imaging radar systems to achieve full environmental perception. The dataset comprises 1501 clips, each approximately 30‑s long, totaling more than 450K synchronized frames and more than 5.85 million synchronized sensor data points. We also propose a novel 4D annotation pipeline. To date, we have annotated 200 clips with more than 514K precise 3D bounding boxes. These clips also include semantic segmentation annotations for static scene elements. Additionally, we introduce a novel automated pipeline for generation of the dense occupancy ground truth, which effectively leverages information from non‑key frames. Alongside the proposed dataset, we establish comprehensive evaluation metrics, baseline models, and benchmarks for 3D detection and semantic occupancy prediction. These benchmarks utilize surround‑view cameras and 4D imaging radar to explore cost‑effective sensor solutions for autonomous driving applications. Extensive experiments demonstrate the effectiveness of our low‑cost sensor configuration and its robustness under adverse conditions. Data will be released at https://www.2077ai.com/OmniHD‑Scenes.

Abstract:
3D Gaussian Splatting has recently gained traction for its efficient training and real‑time rendering. While its vanilla representation is mainly designed for view synthesis, recent works extended it to scene understanding with language features. However, storing additional high‑dimensional features per Gaussian for semantic information is memory‑intensive, which limits their ability to segment and interpret challenging scenes. To this end, we introduce SuperGSeg, a novel approach that fosters cohesive, context‑aware hierarchical scene representation by disentangling segmentation and language field distillation. SuperGSeg first employs neural 3D Gaussians to learn geometry, instance and hierarchical segmentation features from multi‑view images with the aid of off‑the‑shelf 2D masks. These features are then leveraged to create a sparse set of \acrlongsupergs. \acrlongsupergs facilitate the lifting and distillation of 2D language features into 3D space. They enable hierarchical scene understanding with high‑dimensional language feature rendering at moderate GPU memory costs. Extensive experiments demonstrate that SuperGSeg achieves remarkable performance on both open‑vocabulary object selection and semantic segmentation tasks.

Abstract:
The detection and classification of exfoliated two‑dimensional (2D) material flakes from optical microscope images can be automated using computer vision algorithms. This has the potential to increase the accuracy and objectivity of classification and the efficiency of sample fabrication, and it allows for large‑scale data collection. Existing algorithms often exhibit challenges in identifying low‑contrast materials and typically require large amounts of training data. Here, we present a deep learning model, called MaskTerial, that uses an instance segmentation network to reliably identify 2D material flakes. The model is extensively pre‑trained using a synthetic data generator, that generates realistic microscopy images from unlabeled data. This results in a model that can to quickly adapt to new materials with as little as 5 to 10 images. Furthermore, an uncertainty estimation model is used to finally classify the predictions based on optical contrast. We evaluate our method on eight different datasets comprising five different 2D materials and demonstrate significant improvements over existing techniques in the detection of low‑contrast materials such as hexagonal boron nitride.

Abstract:
Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV‑VSS) task, designed to accurately segment every pixel across a wide range of open‑vocabulary categories, including those that are novel or previously unexplored. To enhance OV‑VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial‑temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the video context. Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes highlight OV‑VSS's zero‑shot generalization capabilities, especially in handling novel categories. The results validate OV2VSS's effectiveness, demonstrating improved performance in semantic segmentation tasks across diverse video datasets.

Abstract:
Existing few‑shot medical image segmentation (FSMIS) models fail to address a practical issue in medical imaging: the domain shift caused by different imaging techniques, which limits the applicability to current FSMIS tasks. To overcome this limitation, we focus on the cross‑domain few‑shot medical image segmentation (CD‑FSMIS) task, aiming to develop a generalized model capable of adapting to a broader range of medical image segmentation scenarios with limited labeled data from the novel target domain. Inspired by the characteristics of frequency domain similarity across different domains, we propose a Frequency‑aware Matching Network (FAMNet), which includes two key components: a Frequency‑aware Matching (FAM) module and a Multi‑Spectral Fusion (MSF) module. The FAM module tackles two problems during the meta‑learning phase: 1) intra‑domain variance caused by the inherent support‑query bias, due to the different appearances of organs and lesions, and 2) inter‑domain variance caused by different medical imaging techniques. Additionally, we design an MSF module to integrate the different frequency features decoupled by the FAM module, and further mitigate the impact of inter‑domain variance on the model's segmentation performance. Combining these two modules, our FAMNet surpasses existing FSMIS models and Cross‑domain Few‑shot Semantic Segmentation models on three cross‑domain datasets, achieving state‑of‑the‑art performance in the CD‑FSMIS task.

Abstract:
Segmentation models are typically constrained by the categories defined during training. To address this, researchers have explored two independent approaches: adapting Vision‑Language Models (VLMs) and leveraging synthetic data. However, VLMs often struggle with granularity, failing to disentangle fine‑grained concepts, while synthetic data‑based methods remain limited by the scope of available datasets. This paper proposes enhancing segmentation accuracy across diverse domains by integrating Vision‑Language reasoning with key strategies for Unsupervised Domain Adaptation (UDA). First, we improve the fine‑grained segmentation capabilities of VLMs through multi‑scale contextual data, robust text embeddings with prompt augmentation, and layer‑wise fine‑tuning in our proposed Foundational‑Retaining Open Vocabulary Semantic Segmentation (FROVSS) framework. Next, we incorporate these enhancements into a UDA framework by employing distillation to stabilize training and cross‑domain mixed sampling to boost adaptability without compromising generalization. The resulting UDA‑FROVSS framework is the first UDA approach to effectively adapt across domains without requiring shared categories.

Abstract:
Channel and spatial attention mechanisms introduced by earlier works enhance the representation abilities of deep convolutional neural networks (CNNs) but often lead to increased parameter and computation costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant‑parameter module, STEAM: Squeeze and Transform Enhanced Attention Module, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph‑based approach for modeling both channel and spatial attention, utilizing concepts from multi‑head graph transformers. Additionally, we introduce Output Guided Pooling (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large‑scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a 2% increase in accuracy over the standard ResNet‑50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms leading modules ECA and GCT in terms of accuracy while achieving a three‑fold reduction in GFLOPs.

Abstract:
Semantic segmentation is a fundamental task in multimedia processing, which can be used for analyzing, understanding, editing contents of images and videos, among others. To accelerate the analysis of multimedia data, existing segmentation researches tend to extract semantic information by progressively reducing the spatial resolutions of feature maps. However, this approach introduces a misalignment problem when restoring the resolution of high‑level feature maps. In this paper, we design a Semantic Refinement Module (SRM) to address this issue within the segmentation network. Specifically, SRM is designed to learn a transformation offset for each pixel in the upsampled feature maps, guided by high‑resolution feature maps and neighboring offsets. By applying these offsets to the upsampled feature maps, SRM enhances the semantic representation of the segmentation network, particularly for pixels around object boundaries. Furthermore, a Contextual Refinement Module (CRM) is presented to capture global context information across both spatial and channel dimensions. To balance dimensions between channel and space, we aggregate the semantic maps from all four stages of the backbone to enrich channel context information. The efficacy of these proposed modules is validated on three widely used datasets‑Cityscapes, Bdd100K, and ADE20K‑demonstrating superior performance compared to state‑of‑the‑art methods. Additionally, this paper extends these modules to a lightweight segmentation network, achieving an mIoU of 82.5% on the Cityscapes validation set with only 137.9 GFLOPs.

Abstract:
Low computational complexity and high segmentation accuracy are both essential to the real‑world semantic segmentation tasks. However, to speed up the model inference, most existing approaches tend to design light‑weight networks with a very limited number of parameters, leading to a considerable degradation in accuracy due to the decrease of the representation ability of the networks. To solve the problem, this paper proposes a novel semantic segmentation method to improve the capacity of obtaining semantic information for the light‑weight network. Specifically, a feature refinement module (FRM) is proposed to extract semantics from multi‑stage feature maps generated by the backbone and capture non‑local contextual information by utilizing a transformer block. On Cityscapes and Bdd100K datasets, the experimental results demonstrate that the proposed method achieves a promising trade‑off between accuracy and computational cost, especially for Cityscapes test set where 80.4% mIoU is achieved and only 214.82 GFLOPs are required.

Abstract:
Temporal forward‑tracking has been the dominant approach for multi‑object segmentation and tracking (MOTS). However, a novel time‑symmetric tracking methodology has recently been introduced for the detection, segmentation, and tracking of budding yeast cells in pre‑recorded samples. Although this architecture has demonstrated a unique perspective on stable and consistent tracking, as well as missed instance re‑interpolation, its evaluation has so far been largely confined to settings related to videomicroscopic environments. In this work, we aim to reveal the broader capabilities, advantages, and potential challenges of this architecture across various specifically designed scenarios, including a pedestrian tracking dataset. We also conduct an ablation study comparing the model against its restricted variants and the widely used Kalman filter. Furthermore, we present an attention analysis of the tracking architecture for both pretrained and non‑pretrained models

Abstract:
Audio‑visual video segmentation (AVVS) aims to generate pixel‑level maps of sound‑producing objects that accurately align with the corresponding audio. However, existing methods often face temporal misalignment, where audio cues and segmentation results are not temporally coordinated. Audio provides two critical pieces of information: i) target object‑level details and ii) the timing of when objects start and stop producing sounds. Current methods focus more on object‑level information but neglect the boundaries of audio semantic changes, leading to temporal misalignment. To address this issue, we propose a Collaborative Hybrid Propagator Framework~(Co‑Prop). This framework includes two main steps: Preliminary Audio Boundary Anchoring and Frame‑by‑Frame Audio‑Insert Propagation. To Anchor the audio boundary, we employ retrieval‑assist prompts with Qwen large language models to identify control points of audio semantic changes. These control points split the audio into semantically consistent audio portions. After obtaining the control point lists, we propose the Audio Insertion Propagator to process each audio portion using a frame‑by‑frame audio insertion propagation and matching approach. We curated a compact dataset comprising diverse source conversion cases and devised a metric to assess alignment rates. Compared to traditional simultaneous processing methods, our approach reduces memory requirements and facilitates frame alignment. Experimental results demonstrate the effectiveness of our approach across three datasets and two backbones. Furthermore, our method can be integrated with existing AVVS approaches, offering plug‑and‑play functionality to enhance their performance.

Abstract:
Most existing mobile robotic datasets primarily capture static scenes, limiting their utility for evaluating robotic performance in dynamic environments. To address this, we present a mobile robot oriented large‑scale indoor dataset, denoted as THUD++ (TsingHua University Dynamic) robotic dataset, for dynamic scene understanding. Our current dataset includes 13 large‑scale dynamic scenarios, combining both real‑world and synthetic data collected with a real robot platform and a physical simulation platform, respectively. The RGB‑D dataset comprises over 90K image frames, 20M 2D/3D bounding boxes of static and dynamic objects, camera poses, and IMU. The trajectory dataset covers over 6,000 pedestrian trajectories in indoor scenes. Additionally, the dataset is augmented with a Unity3D‑based simulation platform, allowing researchers to create custom scenes and test algorithms in a controlled environment. We evaluate state‑of‑the‑art methods on THUD++ across mainstream indoor scene understanding tasks, e.g., 3D object detection, semantic segmentation, relocalization, pedestrian trajectory prediction, and navigation. Our experiments highlight the challenges mobile robots encounter in indoor environments, especially when navigating in complex, crowded, and dynamic scenes. By sharing this dataset, we aim to accelerate the development and testing of mobile robot algorithms, contributing to real‑world robotic applications.

Abstract:
Video semantic segmentation(VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping, autonomous driving and surveillance. Its core challenge is how to leverage temporal information to achieve better segmentation. Previous efforts have primarily focused on pixel‑level static‑dynamic contexts matching, utilizing techniques such as optical flow and attention mechanisms. Instead, this paper rethinks static‑dynamic contexts at the class level and proposes a novel static‑dynamic class‑level perceptual consistency (SD‑CPC) framework. In this framework, we propose multivariate class prototype with contrastive learning and a static‑dynamic semantic alignment module. The former provides class‑level constraints for the model, obtaining personalized inter‑class features and diversified intra‑class features. The latter first establishes intra‑frame spatial multi‑scale and multi‑level correlations to achieve static semantic alignment. Then, based on cross‑frame static perceptual differences, it performs two‑stage cross‑frame selective aggregation to achieve dynamic semantic alignment. Meanwhile, we propose a window‑based attention map calculation method that leverages the sparsity of attention points during cross‑frame aggregation to reduce computation cost. Extensive experiments on VSPW and Cityscapes datasets show that the proposed approach outperforms state‑of‑the‑art methods. Our implementation will be open‑sourced on GitHub.

Abstract:
This paper proposes a novel method for omnidirectional 360\degree perception. Most common previous methods relied on equirectangular projection. This representation is easily applicable to 2D operation layers but introduces distortions into the image. Other methods attempted to remove the distortions by maintaining a sphere representation but relied on complicated convolution kernels that failed to show competitive results. In this work, we introduce a transformer‑based architecture that, by incorporating a novel ``Spherical Local Self‑Attention'' and other spherically‑oriented modules, successfully operates in the spherical domain and outperforms the state‑of‑the‑art in 360\degree perception benchmarks for depth estimation and semantic segmentation.

Abstract:
Deep Learning became an ubiquitous paradigm due to its extraordinary effectiveness and applicability in numerous domains. However, the approach suffers from the high demand of data required to achieve the potential of this type of model. An ever‑increasing sub‑field of Artificial Intelligence, Image Synthesis, aims to address this limitation through the design of intelligent models capable of creating original and realistic images, endeavour which could drastically reduce the need for real data. The Stable Diffusion generation paradigm recently propelled state‑of‑the‑art approaches to exceed all previous benchmarks. In this work, we propose the ContRail framework based on the novel Stable Diffusion model ControlNet, which we empower through a multi‑modal conditioning method. We experiment with the task of synthetic railway image generation, where we improve the performance in rail‑specific tasks, such as rail semantic segmentation by enriching the dataset with realistic synthetic images.

Abstract:
Multi‑class semantic segmentation remains a cornerstone challenge in computer vision. Yet, dataset creation remains excessively demanding in time and effort, especially for specialized domains. Active Learning (AL) mitigates this challenge by selecting data points for annotation strategically. However, existing patch‑based AL methods often overlook boundary pixels critical information, essential for accurate segmentation. We present OREAL, a novel patch‑based AL method designed for multi‑class semantic segmentation. OREAL enhances boundary detection by employing maximum aggregation of pixel‑wise uncertainty scores. Additionally, we introduce one‑vs‑rest entropy, a novel uncertainty score function that computes class‑wise uncertainties while achieving implicit class balancing during dataset creation. Comprehensive experiments across diverse datasets and model architectures validate our hypothesis.

Abstract:
In the domain of the U.S. Army modeling and simulation, the availability of high quality annotated 3D data is pivotal to creating virtual environments for training and simulations. Traditional methodologies for 3D semantic and instance segmentation, such as KpConv, RandLA, Mask3D, etc., are designed to train on extensive labeled datasets to obtain satisfactory performance in practical tasks. This requirement presents a significant challenge, given the inherent scarcity of manually annotated 3D datasets, particularly for the military use cases. Recognizing this gap, our previous research leverages the One World Terrain data repository manually annotated databases, as showcased at IITSEC 2019 and 2021, to enrich the training dataset for deep learning models. However, collecting and annotating large scale 3D data for specific tasks remains costly and inefficient. To this end, the objective of this research is to design and develop a comprehensive and efficient framework for 3D segmentation tasks to assist in 3D data annotation. This framework integrates Grounding DINO and Segment anything Model, augmented by an enhancement in 2D image rendering via 3D mesh. Furthermore, the authors have also developed a user friendly interface that facilitates the 3D annotation process, offering intuitive visualization of rendered images and the 3D point cloud.

Abstract:
We focus on tertiary lymphoid structure (TLS) semantic segmentation in whole slide image (WSI). Unlike TLS binary segmentation, TLS semantic segmentation identifies boundaries and maturity, which requires integrating contextual information to discover discriminative features. Due to the extensive scale of WSI (e.g., 100,000 × 100,000 pixels), the segmentation of TLS is usually carried out through a patch‑based strategy. However, this prevents the model from accessing information outside of the patches, limiting the performance. To address this issue, we propose GCUNet, a GNN‑based contextual learning network for TLS semantic segmentation. Given an image patch (target) to be segmented, GCUNet first progressively aggregates long‑range and fine‑grained context outside the target. Then, a Detail and Context Fusion block (DCFusion) is designed to integrate the context and detail of the target to predict the segmentation mask. We build four TLS semantic segmentation datasets, called TCGA‑COAD, TCGA‑LUSC, TCGA‑BLCA and INHOUSE‑PAAD, and make the former three datasets (comprising 826 WSIs and 15,276 TLSs) publicly available to promote the TLS semantic segmentation. Experiments on these datasets demonstrate the superiority of GCUNet, achieving at least 7.41% improvement in mF1 compared with SOTA.

Abstract:
In this paper, we propose a novel semantic splatting approach based on Gaussian Splatting to achieve efficient and low‑latency. Our method projects the RGB attributes and semantic features of point clouds onto the image plane, simultaneously rendering RGB images and semantic segmentation results. Leveraging the explicit structure of point clouds and a one‑time rendering strategy, our approach significantly enhances efficiency during optimization and rendering. Additionally, we employ SAM2 to generate pseudo‑labels for boundary regions, which often lack sufficient supervision, and introduce two‑level aggregation losses at the 2D feature map and 3D spatial levels to improve the view‑consistent and spatial continuity.

Abstract:
The use of synthetic images in medical imaging Artificial Intelligence (AI) solutions has been shown to be beneficial in addressing the limited availability of diverse, unbiased, and representative data. Despite the extensive use of synthetic image generation methods, controlling the semantics variability and context details remains challenging, limiting their effectiveness in producing diverse and representative medical image datasets. In this work, we introduce a scalable semantic and context‑conditioned generative model, coined CSG (Context‑Semantic Guidance). This dual conditioning approach allows for comprehensive control over both structure and appearance, advancing the synthesis of realistic and diverse ultrasound images. We demonstrate the ability of CSG to generate findings (pathological anomalies) in musculoskeletal (MSK) ultrasound images. Moreover, we test the quality of the synthetic images using a three‑fold validation protocol. The results show that the synthetic images generated by CSG improve the performance of semantic segmentation models, exhibit enhanced similarity to real images compared to the baseline methods, and are undistinguishable from real images according to a Turing test. Furthermore, we demonstrate an extension of the CSG that allows enhancing the variability space of images by synthetically generating augmentations of anatomical geometries and textures.

Abstract:
In this study, we developed a customized instance segmentation model by integrating the Convolutional Block Attention Module (CBAM) with the YOLO11 architecture. This model, trained on a mixed dataset of dormant and canopy season apple orchard images, aimed to enhance the segmentation of tree trunks and branches under varying seasonal conditions throughout the year. The model was individually validated across dormant and canopy season images after training the YOLO11‑CBAM on the mixed dataset collected over the two seasons. Additional testing of the model during pre‑bloom, flower bloom, fruit thinning, and harvest season was performed. The highest recall and precision metrics were observed in the YOLO11x‑seg‑CBAM and YOLO11m‑seg‑CBAM respectively. Particularly, YOLO11m‑seg with CBAM showed the highest precision of 0.83 as performed for the Trunk class in training, while without the CBAM, YOLO11m‑seg achieved 0.80 precision score for the Trunk class. Likewise, for branch class, YOLO11m‑seg with CBAM achieved the highest precision score value of 0.75 while without the CBAM, the YOLO11m‑seg achieved a precision of 0.73. For dormant season validation, YOLO11x‑seg exhibited the highest precision at 0.91. Canopy season validation highlighted YOLO11s‑seg with superior precision across all classes, achieving 0.516 for Branch, and 0.64 for Trunk. The modeling approach, trained on two season datasets as dormant and canopy season images, demonstrated the potential of the YOLO11‑CBAM integration to effectively detect and segment tree trunks and branches year‑round across all seasonal variations. Keywords: YOLOv11, YOLOv11 Tree Detection, YOLOv11 Branch Detection and Segmentation, Machine Vision, Deep Learning, Machine Learning

Abstract:
Cloud robotics enables robots to offload complex computational tasks to cloud servers for performance and ease of management. However, cloud compute can be costly, cloud services can suffer occasional downtime, and connectivity between the robot and cloud can be prone to variations in network Quality‑of‑Service (QoS). We present FogROS2‑FT (Fault Tolerant) to mitigate these issues by introducing a multi‑cloud extension that automatically replicates independent stateless robotic services, routes requests to these replicas, and directs the first response back. With replication, robots can still benefit from cloud computations even when a cloud service provider is down or there is low QoS. Additionally, many cloud computing providers offer low‑cost spot computing instances that may shutdown unpredictably. Normally, these low‑cost instances would be inappropriate for cloud robotics, but the fault tolerance nature of FogROS2‑FT allows them to be used reliably. We demonstrate FogROS2‑FT fault tolerance capabilities in 3 cloud‑robotics scenarios in simulation (visual object detection, semantic segmentation, motion planning) and 1 physical robot experiment (scan‑pick‑and‑place). Running on the same hardware specification, FogROS2‑FT achieves motion planning with up to 2.2x cost reduction and up to a 5.53x reduction on 99 Percentile (P99) long‑tail latency. FogROS2‑FT reduces the P99 long‑tail latency of object detection and semantic segmentation by 2.0x and 2.1x, respectively, under network slowdown and resource contention.

Abstract:
Infrared (IR) imaging is commonly used in various scenarios, including autonomous driving, fire safety and defense applications. Thus, semantic segmentation of such images is of great interest. However, this task faces several challenges, including data scarcity, differing contrast and input channel number compared to natural images, and emergence of classes not represented in databases in certain scenarios, such as defense applications. Few‑shot segmentation (FSS) provides a framework to overcome these issues by segmenting query images using a few labeled support samples. However, existing FSS models for IR images require paired visible RGB images, which is a major limitation since acquiring such paired data is difficult or impossible in some applications. In this work, we develop new strategies for FSS of IR images by using generative modeling and fusion techniques. To this end, we propose to synthesize auxiliary data to provide additional channel information to complement the limited contrast in the IR images, as well as IR data synthesis for data augmentation. Here, the former helps the FSS model to better capture the relationship between the support and query sets, while the latter addresses the issue of data scarcity. Finally, to further improve the former aspect, we propose a novel fusion ensemble module for integrating the two different modalities. Our methods are evaluated on different IR datasets, and improve upon the state‑of‑the‑art (SOTA) FSS models.

Abstract:
In the evolving landscape of video enhancement and editing methodologies, a majority of deep learning techniques often rely on extensive datasets of observed input and ground truth sequence pairs for optimal performance. Such reliance often falters when acquiring data becomes challenging, especially in tasks like video dehazing and relighting, where replicating identical motions and camera angles in both corrupted and ground truth sequences is complicated. Moreover, these conventional methodologies perform best when the test distribution closely mirrors the training distribution. Recognizing these challenges, this paper introduces a novel video decomposition prior `VDP' framework which derives inspiration from professional video editing practices. Our methodology does not mandate task‑specific external data corpus collection, instead pivots to utilizing the motion and appearance of the input video. VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels. These set of layers are then manipulated individually to obtain the desired results. We addresses tasks such as video object segmentation, dehazing, and relighting. Moreover, we introduce a novel logarithmic video decomposition formulation for video relighting tasks, setting a new benchmark over the existing methodologies. We observe the property of relighting emerge as we optimize for our novel relighting decomposition formulation. We evaluate our approach on standard video datasets like DAVIS, REVIDE, & SDSD and show qualitative results on a diverse array of internet videos. Project Page ‑ https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video results.

Abstract:
Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single‑image segmentation methods cannot handle high‑levels of occlusions which are better inferred using temporal information, and multi‑frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual pseudo‑depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object. We benchmark our approach alongside a wide array of state‑of‑the‑art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object's occluded region.

Abstract:
Properly understanding the performances of classifiers is essential in various scenarios. However, the literature often relies only on one or two standard scores to compare classifiers, which fails to capture the nuances of application‑specific requirements. The Tile is a recently introduced visualization tool organizing an infinity of ranking scores into a 2D map. Thanks to the Tile, it is now possible to compare classifiers efficiently, displaying all possible application‑specific preferences instead of having to rely on a pair of scores. This hitchhiker's guide to understanding the performances of two‑class classifiers presents four scenarios showcasing different user profiles: a theoretical analyst, a method designer, a benchmarker, and an application developer. We introduce several interpretative flavors adapted to the user's needs by mapping different values on the Tile. We illustrate this guide by ranking and analyzing the performances of 74 state‑of‑the‑art semantic segmentation models through the perspective of the four scenarios. Through these user profiles, we demonstrate that the Tile effectively captures the behavior of classifiers in a single visualization, while accommodating an infinite number of ranking scores. Code for mapping the different Tile flavors is available in supplementary material.

Abstract:
The recent Segment Anything Model (SAM) represents a significant breakthrough in scaling segmentation models, delivering strong performance across various downstream applications in the RGB modality. However, directly applying SAM to emerging visual modalities, such as depth and event data results in suboptimal performance in multi‑modal segmentation tasks. In this paper, we make the first attempt to adapt SAM for multi‑modal semantic segmentation by proposing a Mixture of Low‑Rank Adaptation Experts (MoE‑LoRA) tailored for different input visual modalities. By training only the MoE‑LoRA layers while keeping SAM's weights frozen, SAM's strong generalization and segmentation capabilities can be preserved for downstream tasks. Specifically, to address cross‑modal inconsistencies, we propose a novel MoE routing strategy that adaptively generates weighted features across modalities, enhancing multi‑modal feature integration. Additionally, we incorporate multi‑scale feature extraction and fusion by adapting SAM's segmentation head and introducing an auxiliary segmentation head to combine multi‑scale features for improved segmentation performance effectively. Extensive experiments were conducted on three multi‑modal benchmarks: DELIVER, MUSES, and MCubeS. The results consistently demonstrate that the proposed method significantly outperforms state‑of‑the‑art approaches across diverse scenarios. Notably, under the particularly challenging condition of missing modalities, our approach exhibits a substantial performance gain, achieving an improvement of 32.15% compared to existing methods.

Abstract:
Document comparison typically relies on optical character recognition (OCR) as its core technology. However, OCR requires the selection of appropriate language models for each document and the performance of multilingual or hybrid models remains limited. To overcome these challenges, we propose text change detection (TCD) using an image comparison model tailored for multilingual documents. Unlike OCR‑based approaches, our method employs word‑level text image‑to‑image comparison to detect changes. Our model generates bidirectional change segmentation maps between the source and target documents. To enhance performance without requiring explicit text alignment or scaling preprocessing, we employ correlations among multi‑scale attention features. We also construct a benchmark dataset comprising actual printed and scanned word pairs in various languages to evaluate our model. We validate our approach using our benchmark dataset and public benchmarks Distorted Document Images and the LRDE Document Binarization Dataset. We compare our model against state‑of‑the‑art semantic segmentation and change detection models, as well as to conventional OCR‑based models.

Abstract:
Crowdsourcing provides a flexible approach for leveraging human intelligence to solve large‑scale problems, gaining widespread acceptance in domains like intelligent information processing, social decision‑making, and crowd ideation. However, the uncertainty of participants significantly compromises the answer quality, sparking substantial research interest. Existing surveys predominantly concentrate on quality control in Boolean tasks, which are generally formulated as simple label classification, ranking, or numerical prediction. Ubiquitous open‑ended tasks like question‑answering, translation, and semantic segmentation have not been sufficiently discussed. These tasks usually have large to infinite answer spaces and non‑unique acceptable answers, posing significant challenges for quality assurance. This survey focuses on quality control methods applicable to open‑ended tasks in crowdsourcing. We propose a two‑tiered framework to categorize related works. The first tier introduces a holistic view of the quality model, encompassing key aspects like task, worker, answer, and system. The second tier refines the classification into more detailed categories, including quality dimensions, evaluation metrics, and design decisions, providing insights into the internal structures of the quality control framework in each aspect. We thoroughly investigate how these quality control methods are implemented in state‑of‑the‑art works and discuss key challenges and potential future research directions.

Abstract:
Image Compression for Machines (ICM) aims to compress images for machine vision tasks rather than human viewing. Current works predominantly concentrate on high‑level tasks like object detection and semantic segmentation. However, the quality of original images is usually not guaranteed in the real world, leading to even worse perceptual quality or downstream task performance after compression. Low‑level (LL) machine vision models, like image restoration models, can help improve such quality, and thereby their compression requirements should also be considered. In this paper, we propose a pioneered ICM framework for LL machine vision tasks, namely LL‑ICM. By jointly optimizing compression and LL tasks, the proposed LL‑ICM not only enriches its encoding ability in generalizing to versatile LL tasks but also optimizes the processing ability of down‑stream LL task models, achieving mutual adaptation for image codecs and LL task models. Furthermore, we integrate large‑scale vision‑language models into the LL‑ICM framework to generate more universal and distortion‑robust feature embeddings for LL vision tasks. Therefore, one LL‑ICM codec can generalize to multiple tasks. We establish a solid benchmark to evaluate LL‑ICM, which includes extensive objective experiments by using both full and no‑reference image quality assessments. Experimental results show that LL‑ICM can achieve 22.65% BD‑rate reductions over the state‑of‑the‑art methods.

Abstract:
Machine learning‑based embedded systems employed in safety‑critical applications such as aerospace and autonomous driving need to be robust against perturbations produced by soft errors. Soft errors are an increasing concern in modern digital processors since smaller transistor geometries and lower voltages give electronic devices a higher sensitivity to background radiation. The resilience of deep neural network (DNN) models to perturbations in their parameters is determined, to a large extent, by the structure of the model itself, and also by the selected numerical representation and used arithmetic precision. When compression techniques such as model pruning and model quantization are applied to reduce memory footprint and computational complexity for deployment, both model structure and numerical representation are modified and thus, soft error robustness also changes. In this sense, although the choice of activation functions (AFs) in DNN models is frequently ignored, it conditions not only their accuracy and trainability, but also compressibility rates and numerical robustness. This paper investigates the suitability of using bounded AFs to improve model robustness against DNN parameter perturbations, assessing at the same time the impact of this choice on deployment in terms of model accuracy, compressibility, and computational burden. In particular, we analyze encoder‑decoder fully convolutional models aimed at performing semantic segmentation tasks on hyperspectral images for scene understanding in autonomous driving. Deployment characterization is performed experimentally on an AMD‑Xilinx's KV260 SoM.

Abstract:
Esophageal cancer is among the most common types of cancer worldwide. It is traditionally treated using open esophagectomy, but in recent years, robot‑assisted minimally invasive esophagectomy (RAMIE) has emerged as a promising alternative. However, robot‑assisted surgery can be challenging for novice surgeons, as they often suffer from a loss of spatial orientation. Computer‑aided anatomy recognition holds promise for improving surgical navigation, but research in this area remains limited. In this study, we developed a comprehensive dataset for semantic segmentation in RAMIE, featuring the largest collection of vital anatomical structures and surgical instruments to date. Handling this diverse set of classes presents challenges, including class imbalance and the recognition of complex structures such as nerves. This study aims to understand the challenges and limitations of current state‑of‑the‑art algorithms on this novel dataset and problem. Therefore, we benchmarked eight real‑time deep learning models using two pretraining datasets. We assessed both traditional and attention‑based networks, hypothesizing that attention‑based networks better capture global patterns and address challenges such as occlusion caused by blood or other tissues. The benchmark includes our RAMIE dataset and the publicly available CholecSeg8k dataset, enabling a thorough assessment of surgical segmentation tasks. Our findings indicate that pretraining on ADE20k, a dataset for semantic segmentation, is more effective than pretraining on ImageNet. Furthermore, attention‑based models outperform traditional convolutional neural networks, with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former additionally excelling in average symmetric surface distance.

Abstract:
Multi‑modal image fusion aggregates information from multiple sensor sources, achieving superior visual quality and perceptual features compared to single‑source images, often improving downstream tasks. However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task‑driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. Specifically, our fusion loss includes learnable parameters modeled by a neural network called the loss generation module. This module is supervised by the downstream task loss in a meta‑learning manner. The learning objective is to minimize the task loss of fused images after optimizing the fusion module with the fusion loss. Iterative updates between the fusion module and the loss module ensure that the fusion network evolves toward minimizing task loss, guiding the fusion process toward the task objectives. TDFusion's training relies entirely on the downstream task loss, making it adaptable to any specific task. It can be applied to any architecture of fusion and task networks. Experiments demonstrate TDFusion's performance through fusion experiments conducted on four different datasets, in addition to evaluations on semantic segmentation and object detection tasks.

Abstract:
Few‑shot Semantic Segmentation(FSS)aim to adapt a pre‑trained model to new classes with as few as a single labeled training sample per class. The existing prototypical work used in natural image scenarios biasedly focus on capturing foreground's discrimination while employing a simplistic representation for background, grounded on the inherent observation separation between foreground and background. However, this paradigm is not applicable to medical images where the foreground and background share numerous visual features, necessitating a more detailed description for background. In this paper, we present a new pluggable Background‑fused prototype(Bro)approach for FSS in medical images. Instead of finding a commonality of background subjects in support image, Bro incorporates this background with two pivot designs. Specifically, Feature Similarity Calibration(FeaC)initially reduces noise in the support image by employing feature cross‑attention with the query image. Subsequently, Hierarchical Channel Adversarial Attention(HiCA)merges the background into comprehensive prototypes. We achieve this by a channel groups‑based attention mechanism, where an adversarial Mean‑Offset structure encourages a coarse‑to‑fine fusion. Extensive experiments show that previous state‑of‑the‑art methods, when paired with Bro, experience significant performance improvements. This demonstrates a more integrated way to represent backgrounds specifically for medical image.

Abstract:
Pathological cell semantic segmentation is a fundamental technology in computational pathology, essential for applications like cancer diagnosis and effective treatment. Given that multiple cell types exist across various organs, with subtle differences in cell size and shape, multi‑organ, multi‑class cell segmentation is particularly challenging. Most existing methods employ multi‑branch frameworks to enhance feature extraction, but often result in complex architectures. Moreover, reliance on visual information limits performance in multi‑class analysis due to intricate textural details. To address these challenges, we propose a Multi‑OrgaN multi‑Class cell semantic segmentation method with a single brancH (MONCH) that leverages vision‑language input. Specifically, we design a hierarchical feature extraction mechanism to provide coarse‑to‑fine‑grained features for segmenting cells of various shapes, including high‑frequency, convolutional, and topological features. Inspired by the synergy of textual and multi‑grained visual features, we introduce a progressive prompt decoder to harmonize multimodal information, integrating features from fine to coarse granularity for better context capture. Extensive experiments on the PanNuke dataset, which has significant class imbalance and subtle cell size and shape variations, demonstrate that MONCH outperforms state‑of‑the‑art cell segmentation methods and vision‑language models. Codes and implementations will be made publicly available.

Abstract:
Real‑world image super‑resolution (Real‑ISR) has achieved a remarkable leap by leveraging large‑scale text‑to‑image models, enabling realistic image restoration from given recognition textual prompts. However, these methods sometimes fail to recognize some salient objects, resulting in inaccurate semantic restoration in these regions. Additionally, the same region may have a strong response to more than one prompt and it will lead to semantic ambiguity for image super‑resolution. To alleviate the above two issues, in this paper, we propose to consider semantic segmentation as an additional control condition into diffusion‑based image super‑resolution. Compared to textual prompt conditions, semantic segmentation enables a more comprehensive perception of salient objects within an image by assigning class labels to each pixel. It also mitigates the risks of semantic ambiguities by explicitly allocating objects to their respective spatial regions. In practice, inspired by the fact that image super‑resolution and segmentation can benefit each other, we propose SegSR which introduces a dual‑diffusion framework to facilitate interaction between the image super‑resolution and segmentation diffusion models. Specifically, we develop a Dual‑Modality Bridge module to enable updated information flow between these two diffusion models, achieving mutual benefit during the reverse diffusion process. Extensive experiments show that SegSR can generate realistic images while preserving semantic structures more effectively.

Abstract:
Colorectal polyps are structural abnormalities of the gastrointestinal tract that can potentially become cancerous in some cases. The study introduces a novel framework for colorectal polyp segmentation named the Multi‑Scale and Multi‑Path Cascaded Convolution Network (MMCC‑Net), aimed at addressing the limitations of existing models, such as inadequate spatial dependence representation and the absence of multi‑level feature integration during the decoding stage by integrating multi‑scale and multi‑path cascaded convolutional techniques and enhances feature aggregation through dual attention modules, skip connections, and a feature enhancer. MMCC‑Net achieves superior performance in identifying polyp areas at the pixel level. The Proposed MMCC‑Net was tested across six public datasets and compared against eight SOTA models to demonstrate its efficiency in polyp segmentation. The MMCC‑Net's performance shows Dice scores with confidence intervals ranging between (77.08, 77.56) and (94.19, 94.71) and Mean Intersection over Union (MIoU) scores with confidence intervals ranging from (72.20, 73.00) to (89.69, 90.53) on the six databases. These results highlight the model's potential as a powerful tool for accurate and efficient polyp segmentation, contributing to early detection and prevention strategies in colorectal cancer.

Abstract:
Open compound domain adaptation (OCDA) is a practical domain adaptation problem that consists of a source domain, target compound domain, and unseen open domain. In this problem, the absence of domain labels and pixel‑level segmentation labels for both compound and open domains poses challenges to the direct application of existing domain adaptation and generalization methods. To address this issue, we propose Amplitude‑based curriculum learning and a Hopfield segmentation model for Open Compound Domain Adaptation (AH‑OCDA). Our method comprises two complementary components: 1) amplitude‑based curriculum learning and 2) Hopfield segmentation model. Without prior knowledge of target domains within the compound domains, amplitude‑based curriculum learning gradually induces the semantic segmentation model to adapt from the near‑source compound domain to the far‑source compound domain by ranking unlabeled compound domain images through Fast Fourier Transform (FFT). Additionally, the Hopfield segmentation model maps segmentation feature distributions from arbitrary domains to the feature distributions of the source domain. AH‑OCDA achieves state‑of‑the‑art performance on two OCDA benchmarks and extended open domains, demonstrating its adaptability to continuously changing compound domains and unseen open domains.

Abstract:
Microorganism enumeration is an essential task in many applications, such as assessing contamination levels or ensuring health standards when evaluating surface cleanliness. However, it's traditionally performed by human‑supervised methods that often require manual counting, making it tedious and time‑consuming. Previous research suggests automating this task using computer vision and machine learning methods, primarily through instance segmentation or density estimation techniques. This study conducts a comparative analysis of vision transformers (ViTs) for weakly‑supervised counting in microorganism enumeration, contrasting them with traditional architectures such as ResNet and investigating ViT‑based models such as TransCrowd. We trained different versions of ViTs as the architectural backbone for feature extraction using four microbiology datasets to determine potential new approaches for total microorganism enumeration in images. Results indicate that while ResNets perform better overall, ViTs performance demonstrates competent results across all datasets, opening up promising lines of research in microorganism enumeration. This comparative study contributes to the field of microbial image analysis by presenting innovative approaches to the recurring challenge of microorganism enumeration and by highlighting the capabilities of ViTs in the task of regression counting.

Abstract:
Implicit neural representations and 3D Gaussian splatting (3DGS) have shown great potential for scene reconstruction. Recent studies have expanded their applications in autonomous reconstruction through task assignment methods. However, these methods are mainly limited to single robot, and rapid reconstruction of large‑scale scenes remains challenging. Additionally, task‑driven planning based on surface uncertainty is prone to being trapped in local optima. To this end, we propose the first 3DGS‑based centralized multi‑robot autonomous 3D reconstruction framework. To further reduce time cost of task generation and improve reconstruction quality, we integrate online open‑vocabulary semantic segmentation with surface uncertainty of 3DGS, focusing view sampling on regions with high instance uncertainty. Finally, we develop a multi‑robot collaboration strategy with mode and task assignments improving reconstruction quality while ensuring planning efficiency. Our method demonstrates the highest reconstruction quality among all planning methods and superior planning efficiency compared to existing multi‑robot methods. We deploy our method on multiple robots, and results show that it can effectively plan view paths and reconstruct scenes with high quality.

Abstract:
Robustness to out‑of‑distribution data is crucial for deploying modern neural networks. Recently, Vision Transformers, such as SegFormer for semantic segmentation, have shown impressive robustness to visual corruptions like blur or noise affecting the acquisition device. In this paper, we propose Channel Wise Feature Augmentation (CWFA), a simple yet efficient feature augmentation technique to improve the robustness of Vision Transformers for semantic segmentation. CWFA applies a globally estimated perturbation per encoder with minimal compute overhead during training. Extensive evaluations on Cityscapes and ADE20K, with three state‑of‑the‑art Vision Transformer architectures : SegFormer, Swin Transformer, and Twins demonstrate that CWFA‑enhanced models significantly improve robustness without affecting clean data performance. For instance, on Cityscapes, a CWFA‑augmented SegFormer‑B1 model yields up to 27.7% mIoU robustness gain on impulse noise compared to the non‑augmented SegFormer‑B1. Furthermore, CWFA‑augmented SegFormer‑B5 achieves a new state‑of‑the‑art 84.3% retention rate, a 0.7% improvement over the recently published FAN+STL.

Abstract:
Recent advancements in deep learning‑based compression techniques have surpassed traditional methods. However, deep neural networks remain vulnerable to backdoor attacks, where pre‑defined triggers induce malicious behaviors. This paper introduces a novel frequency‑based trigger injection model for launching backdoor attacks with multiple triggers on learned image compression models. Inspired by the widely used DCT in compression codecs, triggers are embedded in the DCT domain. We design attack objectives tailored to diverse scenarios, including: 1) degrading compression quality in terms of bit‑rate and reconstruction accuracy; 2) targeting task‑driven measures like face recognition and semantic segmentation. To improve training efficiency, we propose a dynamic loss function that balances loss terms with fewer hyper‑parameters, optimizing attack objectives effectively. For advanced scenarios, we evaluate the attack's resistance to defensive preprocessing and propose a two‑stage training schedule with robust frequency selection to enhance resilience. To improve cross‑model and cross‑domain transferability for downstream tasks, we adjust the classification boundary in the attack loss during training. Experiments show that our trigger injection models, combined with minor modifications to encoder parameters, successfully inject multiple backdoors and their triggers into a single compression model, demonstrating strong performance and versatility. (Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)

Abstract:
Spatial understanding of the semantics of the surroundings is a key capability needed by autonomous cars to enable safe driving decisions. Recently, purely vision‑based solutions have gained increasing research interest. In particular, approaches extracting a bird's eye view (BEV) from multiple cameras have demonstrated great performance for spatial understanding. This paper addresses the dependency on learned positional encodings to correlate image and BEV feature map elements for transformer‑based methods. We propose leveraging epipolar geometric constraints to model the relationship between cameras and the BEV by Epipolar Attention Fields. They are incorporated into the attention mechanism as a novel attribution term, serving as an alternative to learned positional encodings. Experiments show that our method EAFormer outperforms previous BEV approaches by 2% mIoU for map semantic segmentation and exhibits superior generalization capabilities compared to implicitly learning the camera configuration.

Abstract:
As large‑scale foundation models trained on billions of image‑‑mask pairs covering a vast diversity of scenes, objects, and contexts, SAM and its upgraded version, SAM~2, have significantly influenced multiple fields within computer vision. Leveraging such unprecedented data diversity, they exhibit strong open‑world segmentation capabilities, with SAM~2 further enhancing these capabilities to support high‑quality video segmentation. While SAMs (SAM and SAM~2) have demonstrated excellent performance in segmenting context‑independent concepts like people, cars, and roads, they overlook more challenging context‑dependent (CD) concepts, such as visual saliency, camouflage, industrial defects, and medical lesions. CD concepts rely heavily on global and local contextual information, making them susceptible to shifts in different contexts, which requires strong discriminative capabilities from the model. The lack of comprehensive evaluation of SAMs limits understanding of their performance boundaries, which may hinder the design of future models. In this paper, we conduct a thorough evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes. We develop a unified evaluation framework for SAM and SAM~2 that supports manual, automatic, and intermediate self‑prompting, aided by our specific prompt generation and interaction strategies. We further explore the potential of SAM~2 for in‑context learning and introduce prompt robustness testing to simulate real‑world imperfect prompts. Finally, we analyze the benefits and limitations of SAMs in understanding CD concepts and discuss their future development in segmentation tasks.

Abstract:
The adoption of Vision Transformers (ViTs) in resource‑constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end‑to‑end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no‑pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.

Abstract:
In medical imaging, efficient segmentation of colon polyps plays a pivotal role in minimally invasive solutions for colorectal cancer. This study introduces a novel approach employing two parallel encoder branches within a network for polyp segmentation. One branch of the encoder incorporates the dual convolution blocks that have the capability to maintain feature information over increased depths, and the other block embraces the single convolution block with the addition of the previous layer's feature, offering diversity in feature extraction within the encoder, combining them before transpose layers with a depth‑wise concatenation operation. Our model demonstrated superior performance, surpassing several established deep‑learning architectures on the Kvasir and CVC‑ClinicDB datasets, achieved a Dice score of 0.919, a mIoU of 0.866 for the Kvasir dataset, and a Dice score of 0.931 and a mIoU of 0.891 for the CVC‑ClinicDB. The visual and quantitative results highlight the efficacy of our model, potentially setting a new model in medical image segmentation.

Abstract:
3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category‑response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross‑entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real‑world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.

Abstract:
While 3D Gaussian Splatting enables high‑quality real‑time rendering, existing Gaussian‑based frameworks for 3D semantic segmentation still face significant challenges in boundary recognition accuracy. To address this, we propose a novel 3DGS‑based framework named GradiSeg, incorporating Identity Encoding to construct a deeper semantic understanding of scenes. Our approach introduces two key modules: Identity Gradient Guided Densification (IGD) and Local Adaptive K‑Nearest Neighbors (LA‑KNN). The IGD module supervises gradients of Identity Encoding to refine Gaussian distributions along object boundaries, aligning them closely with boundary contours. Meanwhile, the LA‑KNN module employs position gradients to adaptively establish locality‑aware propagation of Identity Encodings, preventing irregular Gaussian spreads near boundaries. We validate the effectiveness of our method through comprehensive experiments. Results show that GradiSeg effectively addresses boundary‑related issues, significantly improving segmentation accuracy without compromising scene reconstruction quality. Furthermore, our method's robust segmentation capability and decoupled Identity Encoding representation make it highly suitable for various downstream scene editing tasks, including 3D object removal, swapping and so on.

Abstract:
It is widely agreed that open‑vocabulary‑based approaches outperform classical closed‑set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open‑vocabulary approaches leverage vision‑language models, such as CLIP, to align visual features with rich semantic features acquired through pre‑training on large‑scale vision‑language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image‑level features, it is less effective at pixel‑level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above‑mentioned issues by leveraging multiple large‑scale models to enhance the alignment between fine‑grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state‑of‑the‑art performance across all major open‑vocabulary segmentation benchmarks. The code will be made available soon.

Abstract:
Information retrieval techniques have demonstrated exceptional capabilities in identifying semantic similarities across diverse domains through robust feature representations. However, their potential in guiding synthesis tasks, particularly cross‑view image synthesis, remains underexplored. Cross‑view image synthesis presents significant challenges in establishing reliable correspondences between drastically different viewpoints. To address this, we propose a novel retrieval‑guided framework that reimagines how retrieval techniques can facilitate effective cross‑view image synthesis. Unlike existing methods that rely on auxiliary information, such as semantic segmentation maps or preprocessing modules, our retrieval‑guided framework captures semantic similarities across different viewpoints, trained through contrastive learning to create a smooth embedding space. Furthermore, a novel fusion mechanism leverages these embeddings to guide image synthesis while learning and encoding both view‑invariant and view‑specific features. To further advance this area, we introduce VIGOR‑GEN, a new urban‑focused dataset with complex viewpoint variations in real‑world scenarios. Extensive experiments demonstrate that our retrieval‑guided approach significantly outperforms existing methods on the CVUSA, CVACT and VIGOR‑GEN datasets, particularly in retrieval accuracy (R@1) and synthesis quality (FID). Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross‑domain synthesis challenges.

Abstract:
We present Track Anything Behind Everything (TABE), a novel pipeline for zero‑shot amodal video object segmentation. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero‑shot inference. We pose amodal segmentation as generative outpainting from modal (visible) masks using a pretrained video diffusion model. We do not need to re‑train the diffusion model to accommodate additional input channels but instead use a pretrained model that we fine‑tune at test‑time to allow specialisation towards the tracked object. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. Our model and code will all be released.

Abstract:
The rapid development of Large Multimodal Models (LMMs) has significantly advanced multimodal understanding by harnessing the language abilities of Large Language Models (LLMs) and integrating modality‑specific encoders. However, LMMs are plagued by hallucinations that limit their reliability and adoption. While traditional methods to detect and mitigate these hallucinations often involve costly training or rely heavily on external models, recent approaches utilizing internal model features present a promising alternative. In this paper, we critically assess the limitations of the state‑of‑the‑art training‑free technique, the logit lens, in handling generalized visual hallucinations. We introduce ContextualLens, a refined method that leverages contextual token embeddings from middle layers of LMMs. This approach significantly improves hallucination detection and grounding across diverse categories, including actions and OCR, while also excelling in tasks requiring contextual understanding, such as spatial relations and attribute comparison. Our novel grounding technique yields highly precise bounding boxes, facilitating a transition from Zero‑Shot Object Segmentation to Grounded Visual Question Answering. Our contributions pave the way for more reliable and interpretable multimodal models.

Abstract:
Moving object detection and segmentation from a single moving camera is a challenging task, requiring an understanding of recognition, motion and 3D geometry. Combining both recognition and reconstruction boils down to a fusion problem, where appearance and motion features need to be combined for classification and segmentation. In this paper, we present a novel fusion architecture for monocular motion segmentation ‑ M3Former, which leverages the strong performance of transformers for segmentation and multi‑modal fusion. As reconstructing motion from monocular video is ill‑posed, we systematically analyze different 2D and 3D motion representations for this problem and their importance for segmentation performance. Finally, we analyze the effect of training data and show that diverse datasets are required to achieve SotA performance on Kitti and Davis.

Abstract:
Active research is currently underway to enhance the efficiency of vision transformers (ViTs). Most studies have focused solely on effective token mixers, overlooking the potential relationship with normalization. To boost diverse feature learning, we propose two components: a normalization module called multi‑view normalization (MVN) and a token mixer called multi‑view token mixer (MVTM). The MVN integrates three differently normalized features via batch, layer, and instance normalization using a learnable weighted sum. Each normalization method outputs a different distribution, generating distinct features. Thus, the MVN is expected to offer diverse pattern information to the token mixer, resulting in beneficial synergy. The MVTM is a convolution‑based multiscale token mixer with local, intermediate, and global filters, and it incorporates stage specificity by configuring various receptive fields for the token mixer at each stage, efficiently capturing ranges of visual patterns. We propose a novel ViT model, multi‑vision transformer (MVFormer), adopting the MVN and MVTM in the MetaFormer block, the generalized ViT scheme. Our MVFormer outperforms state‑of‑the‑art convolution‑based ViTs on image classification, object detection, and instance and semantic segmentation with the same or lower parameters and MACs. Particularly, MVFormer variants, MVFormer‑T, S, and B achieve 83.4%, 84.3%, and 84.6% top‑1 accuracy, respectively, on ImageNet‑1K benchmark.

Abstract:
Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real‑world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high‑quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA‑1B and SA‑V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi‑supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on‑device video object segmentation applications.

Abstract:
Creating as‑is models from scratch is to this day still a time‑ and money‑consuming task due to its high manual effort. Therefore, projects, especially those with a big spatial extent, could profit from automating the process of creating semantically rich 3D geometries from surveying data such as Point Cloud Data (PCD). An automation can be achieved by using Machine and Deep Learning Models for object recognition and semantic segmentation of PCD. As PCDs do not usually include more than the mere position and RGB colour values of points, tapping into semantically enriched Geoinformation System (GIS) data can be used to enhance the process of creating meaningful as‑is models. This paper presents a methodology, an implementation framework and a proof of concept for the automated generation of GIS‑informed and BIM‑ready as‑is Building Information Models (BIM) for railway projects. The results show a high potential for cost savings and reveal the unemployed resources of freely accessible GIS data within.

Abstract:
Supervised deep learning requires massive labeled datasets, but obtaining annotations is not always easy or possible, especially for dense tasks like semantic segmentation. To overcome this issue, numerous works explore Unsupervised Domain Adaptation (UDA), which uses a labeled dataset from another domain (source), or Semi‑Supervised Learning (SSL), which trains on a partially labeled set. Despite the success of UDA and SSL, reaching supervised performance at a low annotation cost remains a notoriously elusive goal. To address this, we study the promising setting of Semi‑Supervised Domain Adaptation (SSDA). We propose a simple SSDA framework that combines consistency regularization, pixel contrastive learning, and self‑training to effectively utilize a few target‑domain labels. Our method outperforms prior art in the popular GTA‑to‑Cityscapes benchmark and shows that as little as 50 target labels can suffice to achieve near‑supervised performance. Additional results on Synthia‑to‑Cityscapes, GTA‑to‑BDD and Synthia‑to‑BDD further demonstrate the effectiveness and practical utility of the method. Lastly, we find that existing UDA and SSL methods are not well‑suited for the SSDA setting and discuss design patterns to adapt them.

Abstract:
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually‑captured video. While these tasks rely heavily on known camera poses, the problem of finding such poses using structure‑from‑motion (SfM) often depends on robustly separating static from dynamic parts of a video. The lack of a robust solution to this problem limits the performance of SfM camera‑calibration pipelines. We propose a novel approach to video‑based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre‑trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off‑the‑shelf SfM pipeline with our segmentation masks establishes a new state‑of‑the‑art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.

Abstract:
In autonomous driving, environment perception has significantly advanced with the utilization of deep learning techniques for diverse sensors such as cameras, depth sensors, or infrared sensors. The diversity in the sensor stack increases the safety and contributes to robustness against adverse weather and lighting conditions. However, the variance in data acquired from different sensors poses challenges. In the context of continual learning (CL), incremental learning is especially challenging for considerably large domain shifts, e.g. different sensor modalities. This amplifies the problem of catastrophic forgetting. To address this issue, we formulate the concept of modality‑incremental learning and examine its necessity, by contrasting it with existing incremental learning paradigms. We propose the use of a modified Relevance Mapping Network (RMN) to incrementally learn new modalities while preserving performance on previously learned modalities, in which relevance maps are disjoint. Experimental results demonstrate that the prevention of shared connections in this approach helps alleviate the problem of forgetting within the constraints of a strict continual learning framework.

Abstract:
This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine‑grained vision‑language correlations. We propose HyperSeg, the first VLLM‑based universal segmentation model for pixel‑level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine‑grained visual information, HyperSeg incorporates hybrid entity recognition and fine‑grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

Abstract:
Caenorhabditis elegans (C. elegans) is an excellent model organism because of its short lifespan and high degree of homology with human genes, and it has been widely used in a variety of human health and disease models. However, the segmentation of C. elegans remains challenging due to the following reasons: 1) the activity trajectory of C. elegans is uncontrollable, and multiple nematodes often overlap, resulting in blurred boundaries of C. elegans. This makes it impossible to clearly study the life trajectory of a certain nematode; and 2) in the microscope images of overlapping C. elegans, the translucent tissues at the edges obscure each other, leading to inaccurate boundary segmentation. To solve these problems, a Bilayer Segmentation‑Recombination Network (BR‑Net) for the segmentation of C. elegans instances is proposed. The network consists of three parts: A Coarse Mask Segmentation Module (CMSM), a Bilayer Segmentation Module (BSM), and a Semantic Consistency Recombination Module (SCRM). The CMSM is used to extract the coarse mask, and we introduce a Unified Attention Module (UAM) in CMSM to make CMSM better aware of nematode instances. The Bilayer Segmentation Module (BSM) segments the aggregated C. elegans into overlapping and non‑overlapping regions. This is followed by integration by the SCRM, where semantic consistency regularization is introduced to segment nematode instances more accurately. Finally, the effectiveness of the method is verified on the C. elegans dataset. The experimental results show that BR‑Net exhibits good competitiveness and outperforms other recently proposed instance segmentation methods in processing C. elegans occlusion images.

Abstract:
The article discusses the use of low cost System‑On‑Module (SOM) platforms for the implementation of efficient hyperspectral imaging (HSI) processors for application in autonomous driving. The work addresses the challenges of shaping and deploying multiple layer fully convolutional networks (FCN) for low‑latency, on‑board image semantic segmentation using resource‑ and power‑constrained processing devices. The paper describes in detail the steps followed to redesign and customize a successfully trained HSI segmentation lightweight FCN that was previously tested on a high‑end heterogeneous multiprocessing system‑on‑chip (MPSoC) to accommodate it to the constraints imposed by a low‑cost SOM. This SOM features a lower‑end but much cheaper MPSoC suitable for the deployment of automatic driving systems (ADS). In particular the article reports the data‑ and hardware‑specific quantization techniques utilized to fit the FCN into a commercial fixed‑point programmable AI coprocessor IP, and proposes a full customized post‑training quantization scheme to reduce computation and storage costs without compromising segmentation accuracy.

Abstract:
Self‑supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds. Unlike 2D images, which are widely accessible, acquiring 3D assets requires specialized expertise or professional 3D scanning equipment, making it difficult to scale and raising copyright concerns. To address these challenges, we propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple 3D primitives and augmentations. Remarkably, despite lacking semantic content, the 3D representations learned from the procedurally generated 3D shapes perform on par with state‑of‑the‑art representations learned from semantically recognizable 3D models (e.g., airplanes) across various downstream 3D tasks, such as shape classification, part segmentation, masked point cloud completion, and both scene semantic and instance segmentation. We provide a detailed analysis on factors that make a good 3D procedural programs. Extensive experiments further suggest that current 3D self‑supervised learning methods on point clouds do not rely on semantics of 3D shapes, shedding light on the nature of 3D representations learned.

Abstract:
Tracking geographic entities from historical maps, such as buildings, offers valuable insights into cultural heritage, urbanization patterns, environmental changes, and various historical research endeavors. However, linking these entities across diverse maps remains a persistent challenge for researchers. Traditionally, this has been addressed through a two‑step process: detecting entities within individual maps and then associating them via a heuristic‑based post‑processing step. In this paper, we propose a novel approach that combines segmentation and association of geographic entities in historical maps using video instance segmentation (VIS). This method significantly streamlines geographic entity alignment and enhances automation. However, acquiring high‑quality, video‑format training data for VIS models is prohibitively expensive, especially for historical maps that often contain hundreds or thousands of geographic entities. To mitigate this challenge, we explore self‑supervised learning (SSL) techniques to enhance VIS performance on historical maps. We evaluate the performance of VIS models under different pretraining configurations and introduce a novel method for generating synthetic videos from unlabeled historical map images for pretraining. Our proposed self‑supervised VIS method substantially reduces the need for manual annotation. Experimental results demonstrate the superiority of the proposed self‑supervised VIS approach, achieving a 24.9% improvement in AP and a 0.23 increase in F1 score compared to the model trained from scratch.

Abstract:
Relic landslide, formed over a long period, possess the potential for reactivation, making them a hazardous geological phenomenon. While reliable relic landslide detection benefits the effective monitoring and prevention of landslide disaster, semantic segmentation using high‑resolution remote sensing images for relic landslides faces many challenges, including the object visual blur problem, due to the changes of appearance caused by prolonged natural evolution and human activities, and the small‑sized dataset problem, due to difficulty in recognizing and labelling the samples. To address these challenges, a semantic segmentation model, termed mask‑recovering and interactive‑feature‑enhancing (MRIFE), is proposed for more efficient feature extraction and separation. Specifically, a contrastive learning and mask reconstruction method with locally significant feature enhancement is proposed to improve the ability to distinguish between the target and background and represent landslide semantic features. Meanwhile, a dual‑branch interactive feature enhancement architecture is used to enrich the extracted features and address the issue of visual ambiguity. Self‑distillation learning is introduced to leverage the feature diversity both within and between samples for contrastive learning, improving sample utilization, accelerating model convergence, and effectively addressing the problem of the small‑sized dataset. The proposed MRIFE is evaluated on a real relic landslide dataset, and experimental results show that it greatly improves the performance of relic landslide detection. For the semantic segmentation task, compared to the baseline, the precision increases from 0.4226 to 0.5347, the mean intersection over union (IoU) increases from 0.6405 to 0.6680, the landslide IoU increases from 0.3381 to 0.3934, and the F1‑score increases from 0.5054 to 0.5646.

Abstract:
Open‑Vocabulary Semantic Segmentation (OVSS) has advanced with recent vision‑language models (VLMs), enabling segmentation beyond predefined categories through various learning schemes. Notably, training‑free methods offer scalable, easily deployable solutions for handling unseen data, a key goal of OVSS. Yet, a critical issue persists: lack of object‑level context consideration when segmenting complex objects in the challenging environment of OVSS based on arbitrary query prompts. This oversight limits models' ability to group semantically consistent elements within object and map them precisely to user‑defined arbitrary classes. In this work, we introduce a novel approach that overcomes this limitation by incorporating object‑level contextual knowledge within images. Specifically, our model enhances intra‑object consistency by distilling spectral‑driven features from vision foundation models into the attention mechanism of the visual encoder, enabling semantically coherent components to form a single object mask. Additionally, we refine the text embeddings with zero‑shot object presence likelihood to ensure accurate alignment with the specific objects represented in the images. By leveraging object‑level contextual knowledge, our proposed approach achieves state‑of‑the‑art performance with strong generalizability across diverse datasets.

Abstract:
Terraced field is a significant engineering practice for soil and water conservation (SWC). Terraced field extraction from remotely sensed imagery is the foundation for monitoring and evaluating SWC. This study is the first to propose a novel dual‑modal Ω‑like super‑resolution Transformer network for intelligent TFVE, offering the following advantages: (1) reducing edge segmentation error from conventional multi‑scale downsampling encoder, through fusing original high‑resolution features with downsampling features at each step of encoder and leveraging a multi‑head attention mechanism; (2) improving the accuracy of TFVE by proposing a Ω‑like network structure, which fully integrates rich high‑level features from both spectral and terrain data to form cross‑scale super‑resolution features; (3) validating an optimal fusion scheme for cross‑modal and cross‑scale (i.e., inconsistent spatial resolution between remotely sensed imagery and DEM) super‑resolution feature extraction; (4) mitigating uncertainty between segmentation edge pixels by a coarse‑to‑fine and spatial topological semantic relationship optimization (STSRO) segmentation strategy; (5) leveraging contour vibration neural network to continuously optimize parameters and iteratively vectorize terraced fields from semantic segmentation results. Moreover, a DMRVD for deep‑learning‑based TFVE was created for the first time, which covers nine study areas in four provinces of China, with a total coverage area of 22441 square kilometers. To assess the performance of ΩSFormer, classic and SOTA networks were compared. The mIOU of ΩSFormer has improved by 0.165, 0.297 and 0.128 respectively, when compared with best accuracy single‑modal remotely sensed imagery, single‑modal DEM and dual‑modal result.

Abstract:
The Vision Transformer (ViT) has achieved notable success in computer vision, with its variants widely validated across various downstream tasks, including semantic segmentation. However, as general‑purpose visual encoders, ViT backbones often do not fully address the specific requirements of task decoders, highlighting opportunities for designing decoders optimized for efficient semantic segmentation. This paper proposes Strip Cross‑Attention (SCASeg), an innovative decoder head specifically designed for semantic segmentation. Instead of relying on the conventional skip connections, we utilize lateral connections between encoder and decoder stages, leveraging encoder features as Queries in cross‑attention modules. Additionally, we introduce a Cross‑Layer Block (CLB) that integrates hierarchical feature maps from various encoder and decoder stages to form a unified representation for Keys and Values. The CLB also incorporates the local perceptual strengths of convolution, enabling SCASeg to capture both global and local context dependencies across multiple layers, thus enhancing feature interaction at different scales and improving overall efficiency. To further optimize computational efficiency, SCASeg compresses the channels of queries and keys into one dimension, creating strip‑like patterns that reduce memory usage and increase inference speed compared to traditional vanilla cross‑attention. Experiments show that SCASeg's adaptable decoder delivers competitive performance across various setups, outperforming leading segmentation architectures on benchmark datasets, including ADE20K, Cityscapes, COCO‑Stuff 164k, and Pascal VOC2012, even under diverse computational constraints.

Abstract:
Diffusion models have recently been employed to generate high‑quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task‑Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real‑world images. A Uni‑Controlled Image Generation Module is then developed to create high‑quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task‑Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few‑shot Object Detection, Cross‑domain Object Detection, Zero‑shot Composed Image Retrieval, and Multi‑modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.

Abstract:
Despite the recent progress in deep learning based computer vision, domain shifts are still one of the major challenges. Semantic segmentation for autonomous driving faces a wide range of domain shifts, e.g. caused by changing weather conditions, new geolocations and the frequent use of synthetic data in model training. Unsupervised domain adaptation (UDA) methods have emerged which adapt a model to a new target domain by only using unlabeled data of that domain. The variety of UDA methods is large but all of them use ImageNet pre‑trained models. Recently, vision‑language models have demonstrated strong generalization capabilities which may facilitate domain adaptation. We show that simply replacing the encoder of existing UDA methods like DACS by a vision‑language pre‑trained encoder can result in significant performance improvements of up to 10.0% mIoU on the GTA5‑to‑Cityscapes domain shift. For the generalization performance to unseen domains, the newly employed vision‑language pre‑trained encoder provides a gain of up to 13.7% mIoU across three unseen datasets. However, we find that not all UDA methods can be easily paired with the new encoder and that the UDA performance does not always likewise transfer into generalization performance. Finally, we perform our experiments on an adverse weather condition domain shift to further verify our findings on a pure real‑to‑real domain shift.

Abstract:
Advances in architectural design, data availability, and compute have driven remarkable progress in semantic segmentation. Yet, these models often rely on relaxed Bayesian assumptions, omitting critical uncertainty information needed for robust decision‑making. Despite growing interest in probabilistic segmentation to address point‑estimate limitations, the research landscape remains fragmented. In response, this review synthesizes foundational concepts in uncertainty modeling, analyzing how feature‑ and parameter‑distribution modeling impact four key segmentation tasks: Observer Variability, Active Learning, Model Introspection, and Model Generalization. Our work establishes a common framework by standardizing theory, notation, and terminology, thereby bridging the gap between method developers, task specialists, and applied researchers. We then discuss critical challenges, including the nuanced distinction between uncertainty types, strong assumptions in spatial aggregation, the lack of standardized benchmarks, and pitfalls in current quantification methods. We identify promising avenues for future research, such as uncertainty‑aware active learning, data‑driven benchmarks, transformer‑based models, and novel techniques to move from simple segmentation problems to uncertainty in holistic scene understanding. Based on our analysis, we offer practical guidelines for researchers on method selection, evaluation, reproducibility, and meaningful uncertainty estimation. Ultimately, our goal is to facilitate the development of more reliable, efficient, and interpretable segmentation models that can be confidently deployed in real‑world applications.

Abstract:
Existing conditional Denoising Diffusion Probabilistic Models (DDPMs) with a Noise‑Conditional Framework (NCF) remain challenging for 3D scene understanding tasks, as the complex geometric details in scenes increase the difficulty of fitting the gradients of the data distribution (the scores) from semantic labels. This also results in longer training and inference time for DDPMs compared to non‑DDPMs. From a different perspective, we delve deeply into the model paradigm dominated by the Conditional Network. In this paper, we propose an end‑to‑end robust semantic Segmentation Network based on a Conditional‑Noise Framework (CNF) of DDPMs, named CDSegNet. Specifically, CDSegNet models the Noise Network (NN) as a learnable noise‑feature generator. This enables the Conditional Network (CN) to understand 3D scene semantics under multi‑level feature perturbations, enhancing the generalization in unseen scenes. Meanwhile, benefiting from the noise system of DDPMs, CDSegNet exhibits strong noise and sparsity robustness in experiments. Moreover, thanks to CNF, CDSegNet can generate the semantic labels in a single‑step inference like non‑DDPMs, due to avoiding directly fitting the scores from semantic labels in the dominant network of CDSegNet. On public indoor and outdoor benchmarks, CDSegNet significantly outperforms existing methods, achieving state‑of‑the‑art performance.

Abstract:
Autonomous driving needs good roads, but 85% of Brazilian roads have damages that deep learning models may not regard as most semantic segmentation datasets for autonomous driving are high‑resolution images of well‑maintained urban roads. A representative dataset for emerging countries consists of low‑resolution images of poorly maintained roads and includes labels of damage classes; in this scenario, three challenges arise: objects with few pixels, objects with undefined shapes, and highly underrepresented classes. To tackle these challenges, this work proposes the Performance Increment Strategy for Semantic Segmentation (PISSS) as a methodology of 14 training experiments to boost performance. With PISSS, we reached state‑of‑the‑art results of 79.8 and 68.8 mIoU on the Road Traversing Knowledge (RTK) and Technik Autonomer Systeme 500 (TAS500) test sets, respectively. Furthermore, we also offer an analysis of DeepLabV3+ pitfalls for small object segmentation.

Abstract:
Existing 3D instance segmentation methods frequently encounter issues with over‑segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D‑Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM‑2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class‑Agnostic, Open‑Vocabulary, and Open‑Ended 3D Instance Segmentation tasks.

Abstract:
This study explores the effectiveness of multi‑temporal satellite imagery for better functional field boundary delineation using deep learning semantic segmentation architecture on two distinct geographical and multi‑scale farming systems of Netherlands and Pakistan. Multidate images of April, August and October 2022 were acquired for PlanetScope and Sentinel‑2 in sub regions of Netherlands and November 2022, February and March 2023 for selected area of Dunyapur in Pakistan. For Netherlands, Basic registration crop parcels (BRP) vector layer was used as labeled training data. while self‑crafted field boundary vector data were utilized for Pakistan. Four deep learning models with UNET architecture were evaluated using different combinations of multi‑date images and NDVI stacks in the Netherlands subregions. A comparative analysis of IoU scores assessed the effectiveness of the proposed multi‑date NDVI stack approach. These findings were then applied for transfer learning, using pre‑trained models from the Netherlands on the selected area in Pakistan. Additionally, separate models were trained using self‑crafted field boundary data for Pakistan, and combined models were developed using data from both the Netherlands and Pakistan. Results indicate that multi‑date NDVI stacks provide additional temporal context, reflecting crop growth over different times of the season. The study underscores the critical role of multi‑scale ground information from diverse geographical areas in developing robust and universally applicable models for field boundary delineation. The results also highlight the importance of fine spatial resolution for extraction of field boundaries in regions with small scale framing. The findings can be extended to multi‑scale implementations for improved automatic field boundary delineation in heterogeneous agricultural environments.

Abstract:
While vision‑language models like CLIP have shown remarkable success in open‑vocabulary tasks, their application is currently confined to image‑level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self‑attention layers in the final block, and have achieved commendable results by modifying the original query‑key attention to self‑correlation attention, (e.g., query‑query and key‑key attention). However, these methods overlook the cross‑correlation attention (query‑key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross‑correlation of the self‑attention in CLIP's non‑final layers also exhibits localization properties. Therefore, we propose the Residual Cross‑correlation Self‑attention (RCS) module, which leverages the cross‑correlation self‑attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision‑language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug‑and‑play module, significantly boosting their performance in dense vision‑language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state‑of‑the‑art training‑free methods, validating the effectiveness of the proposed approach. Code is available at https://github.com/yvhangyang/ResCLIP.

Abstract:
Open‑vocabulary semantic segmentation aims to assign pixel‑level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre‑trained vision‑language model like CLIP. But these two‑stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC‑Net, a novel one‑stage open‑vocabulary segmentation model that leverages the SAM decoder blocks for class‑agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image‑text correlations into SAM's promptable segmentation framework, ESC‑Net achieves refined spatial aggregation for accurate mask predictions. ESC‑Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL‑VOC, and PASCAL‑Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.

Abstract:
Transformer‑based methods have become the dominant approach for 3D instance segmentation. These methods predict instance masks via instance queries, ranking them by classification confidence and IoU scores to select the top prediction as the final outcome. However, it has been observed that the current models employ a fixed and higher number of queries than the instances present within a scene. In such instances, multiple queries predict the same instance, yet only a single query is ultimately optimized. The close scores of queries in the lower‑level decoders make it challenging for the dominant query to distinguish itself rapidly, which ultimately impairs the model's accuracy and convergence efficiency. This phenomenon is referred to as inter‑query competition. To address this challenge, we put forth a series of plug‑and‑play competition‑oriented designs, collectively designated as the CompetitorFormer, with the aim of reducing competition and facilitating a dominant query. Experiments showed that integrating our designs with state‑of‑the‑art frameworks consistently resulted in significant performance improvements in 3D instance segmentation across a range of datasets.

Abstract:
The Segment‑Anything Model (SAM) is a vision foundation model for segmentation with a prompt‑driven framework. SAM generates class‑agnostic masks based on user‑specified instance‑referring prompts. However, adapting SAM for automated segmentation ‑‑ where manual input is absent ‑‑ of specific object classes often requires additional model training. We present Segment Any Class (SAC), a novel, training‑free approach that task‑adapts SAM for Multi‑class segmentation. SAC generates Class‑Region Proposals (CRP) on query images which allows us to automatically generate class‑aware prompts on probable locations of class instances. CRPs are derived from elementary intra‑class and inter‑class feature distinctions without any additional training. Our method is versatile, accommodating any N‑way K‑shot configurations for the multi‑class few‑shot semantic segmentation (FSS) task. Unlike gradient‑learning adaptation of generalist models which risk the loss of generalization and potentially suffer from catastrophic forgetting, SAC solely utilizes automated prompting and achieves superior results over state‑of‑the‑art methods on the COCO‑20i benchmark, particularly excelling in high N‑way class scenarios. SAC is an interesting demonstration of a prompt‑only approach to adapting foundation models for novel tasks with small, limited datasets without any modifications to the foundation model itself. This method offers interesting benefits such as intrinsic immunity to concept or feature loss and rapid, online task adaptation of foundation models.

Abstract:
We present FAST‑Splat for fast, ambiguity‑free semantic Gaussian Splatting, which seeks to address the main limitations of existing semantic Gaussian Splatting methods, namely: slow training and rendering speeds; high memory usage; and ambiguous semantic object localization. We take a bottom‑up approach in deriving FAST‑Splat, dismantling the limitations of closed‑set semantic distillation to enable open‑set (open‑vocabulary) semantic distillation. Ultimately, this key approach enables FAST‑Splat to provide precise semantic object localization results, even when prompted with ambiguous user‑provided natural‑language queries. Further, by exploiting the explicit form of the Gaussian Splatting scene representation to the fullest extent, FAST‑Splat retains the remarkable training and rendering speeds of Gaussian Splatting. Precisely, while existing semantic Gaussian Splatting methods distill semantics into a separate neural field or utilize neural models for dimensionality reduction, FAST‑Splat directly augments each Gaussian with specific semantic codes, preserving the training, rendering, and memory‑usage advantages of Gaussian Splatting over neural field methods. These Gaussian‑specific semantic codes, together with a hash‑table, enable semantic similarity to be measured with open‑vocabulary user prompts and further enable FAST‑Splat to respond with unambiguous semantic object labels and 3D masks, unlike prior methods. In experiments, we demonstrate that FAST‑Splat is 6x to 8x faster to train, achieves between 18x to 51x faster rendering speeds, and requires about 6x smaller GPU memory, compared to the best‑competing semantic Gaussian Splatting methods. Further, FAST‑Splat achieves relatively similar or better semantic segmentation performance compared to existing methods. After the review period, we will provide links to the project website and the codebase.

Abstract:
Detecting disasters in underground mining, such as explosions and structural damage, has been a persistent challenge over the years. This problem is compounded for first responders, who often have no clear information about the extent or nature of the damage within the mine. The poor‑light or even total darkness inside the mines makes rescue efforts incredibly difficult, leading to a tragic loss of life. In this paper, we propose a novel instance segmentation method called DIS‑Mine, specifically designed to identify disaster‑affected areas within underground mines under low‑light or poor visibility conditions, aiding first responders in rescue efforts. DIS‑Mine is capable of detecting objects in images, even in complete darkness, by addressing challenges such as high noise, color distortions, and reduced contrast. The key innovations of DIS‑Mine are built upon four core components: i) Image brightness improvement, ii) Instance segmentation with SAM integration, iii) Mask R‑CNN‑based segmentation, and iv) Mask alignment with feature matching. On top of that, we have collected real‑world images from an experimental underground mine, introducing a new dataset named ImageMine, specifically gathered in low‑visibility conditions. This dataset serves to validate the performance of DIS‑Mine in realistic, challenging environments. Our comprehensive experiments on the ImageMine dataset, as well as on various other datasets demonstrate that DIS‑Mine achieves a superior F1 score of 86.0% and mIoU of 72.0%, outperforming state‑of‑the‑art instance segmentation methods, with at least 15x improvement and up to 80% higher precision in object detection.

Abstract:
Microscopy structure segmentation, such as detecting cells or nuclei, generally requires a human to draw a ground truth contour around each instance. Weakly supervised approaches (e.g. consisting of only single point labels) have the potential to reduce this workload significantly. Our approach uses individual point labels for an entropy estimation to approximate an underlying distribution of cell pixels. We infer full cell masks from this distribution, and use Mask‑RCNN to produce an instance segmentation output. We compare this point‑‑annotated approach with training on the full ground truth masks. We show that our method achieves a comparatively good level of performance, despite a 95% reduction in pixel labels.

Abstract:
Large‑scale 2D datasets have been instrumental in advancing machine learning; however, progress in 3D vision tasks has been relatively slow. This disparity is largely due to the limited availability of 3D benchmarking datasets. In particular, creating real‑world point cloud datasets for indoor scene semantic segmentation presents considerable challenges, including data collection within confined spaces and the costly, often inaccurate process of per‑point labeling to generate ground truths. While synthetic datasets address some of these challenges, they often fail to replicate real‑world conditions, particularly the occlusions that occur in point clouds collected from real environments. Existing 3D benchmarking datasets typically evaluate deep learning models under the assumption that training and test data are independently and identically distributed (IID), which affects the models' usability for real‑world point cloud segmentation. To address these challenges, we introduce the BelHouse3D dataset, a new synthetic point cloud dataset designed for 3D indoor scene semantic segmentation. This dataset is constructed using real‑world references from 32 houses in Belgium, ensuring that the synthetic data closely aligns with real‑world conditions. Additionally, we include a test set with data occlusion to simulate out‑of‑distribution (OOD) scenarios, reflecting the occlusions commonly encountered in real‑world point clouds. We evaluate popular point‑based semantic segmentation methods using our OOD setting and present a benchmark. We believe that BelHouse3D and its OOD setting will advance research in 3D point cloud semantic segmentation for indoor scenes, providing valuable insights for the development of more generalizable models.

Abstract:
This research presents an advanced AI‑powered ultrasound imaging system that incorporates real‑time image processing, organ tracking, and voice commands to enhance the efficiency and accuracy of diagnoses in clinical practice. Traditional ultrasound diagnostics often require significant time and introduce a degree of subjectivity due to user interaction. The goal of this innovative solution is to provide Sonologists with a more predictable and productive imaging procedure utilizing artificial intelligence, computer vision, and voice technology. The functionality of the system employs computer vision and deep learning algorithms, specifically adopting the Mask R‑CNN model from Detectron2 for semantic segmentation of organs and key landmarks. This automation improves diagnostic accuracy by enabling the extraction of valuable information with minimal human input. Additionally, it includes a voice recognition feature that allows for hands‑free operation, enabling users to control the system with commands such as freeze or liver, all while maintaining their focus on the patient. The architecture comprises video processing and real‑time segmentation modules that prepare the system to perform essential imaging functions, such as freezing and zooming in on frames. The liver histopathology module, optimized for detecting fibrosis, achieved an impressive accuracy of 98.6%. Furthermore, the organ segmentation module produces output confidence levels between 50% and 95%, demonstrating its efficacy in organ detection.

Abstract:
Accurate segmentation of Optical Coherence Tomography (OCT) images is crucial for diagnosing and monitoring retinal diseases. However, the labor‑intensive nature of pixel‑level annotation limits the scalability of supervised learning for large datasets. Weakly Supervised Semantic Segmentation (WSSS) offers a promising alternative by using weaker forms of supervision, such as image‑level labels, to reduce the annotation burden. Despite its advantages, weak supervision inherently carries limited information. We propose a novel WSSS framework with only image‑level labels for OCT lesion segmentation that integrates structural and text‑driven guidance to produce high‑quality, pixel‑level pseudo labels. The framework employs two visual processing modules: one that processes the original OCT images and another that operates on layer segmentations augmented with anomalous signals, enabling the model to associate lesions with their corresponding anatomical layers. Complementing these visual cues, we leverage large‑scale pretrained models to provide two forms of textual guidance: label‑derived descriptions that encode local semantics, and domain‑agnostic synthetic descriptions that, although expressed in natural image terms, capture spatial and relational semantics useful for generating globally consistent representations. By fusing these visual and textual features in a multi‑modal framework, our method aligns semantic meaning with structural relevance, thereby improving lesion localization and segmentation performance. Experiments on three OCT datasets demonstrate state‑of‑the‑art results, highlighting its potential to advance diagnostic accuracy and efficiency in medical imaging.

Abstract:
Transformers have revolutionized Computer Vision (CV) through self‑attention mechanisms. However, their complexity makes latent token representations difficult to interpret. We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. ULTra enables unsupervised semantic segmentation using pre‑trained models without requiring fine‑tuning. Additionally, we propose a self‑supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. Our method achieves state‑of‑the‑art performance in unsupervised semantic segmentation, outperforming existing segmentation methods. Furthermore, we validate ULTra for model interpretation on both synthetic and real‑world scenarios, including Object Selection and interpretable text summarization using LLMs, demonstrating its broad applicability in explaining the semantic structure of latent token representations.

Abstract:
Event cameras operate fundamentally differently from traditional Active Pixel Sensor (APS) cameras, offering significant advantages. Recent research has developed simulators to convert video frames into events, addressing the shortage of real event datasets. Current simulators primarily focus on the logical behavior of event cameras. However, the fundamental analogue properties of pixel circuits are seldom considered in simulator design. The gap between analogue pixel circuit and discrete video frames causes the degeneration of synthetic events, particularly in high‑contrast scenes. In this paper, we propose a novel method of generating reliable event data based on a detailed analysis of the pixel circuitry in event cameras. We incorporate the analogue properties of event camera pixel circuits into the simulator design: (1) analogue filtering of signals from light intensity to events, and (2) a cutoff frequency that is independent of video frame rate. Experimental results on two relevant tasks, including semantic segmentation and image reconstruction, validate the reliability of simulated event data, even in high‑contrast scenes. This demonstrates that deep neural networks exhibit strong generalization from simulated to real event data, confirming that the synthetic events generated by the proposed method are both realistic and well‑suited for effective training.

Abstract:
We present LiV‑GS, a LiDAR‑visual SLAM system in outdoor environments that leverages 3D Gaussian as a differentiable spatial representation. Notably, LiV‑GS is the first method that directly aligns discrete and sparse LiDAR data with continuous differentiable Gaussian maps in large‑scale outdoor scenes, overcoming the limitation of fixed resolution in traditional LiDAR mapping. The system aligns point clouds with Gaussian maps using shared covariance attributes for front‑end tracking and integrates the normal orientation into the loss function to refines the Gaussian map. To reliably and stably update Gaussians outside the LiDAR field of view, we introduce a novel conditional Gaussian constraint that aligns these Gaussians closely with the nearest reliable ones. The targeted adjustment enables LiV‑GS to achieve fast and accurate mapping with novel view synthesis at a rate of 7.98 FPS. Extensive comparative experiments demonstrate LiV‑GS's superior performance in SLAM, image rendering and mapping. The successful cross‑modal radar‑LiDAR localization highlights the potential of LiV‑GS for applications in cross‑modal semantic positioning and object segmentation with Gaussian maps.

Abstract:
Reliable deep learning models require not only accurate predictions but also well‑calibrated confidence estimates to ensure dependable uncertainty estimation. This is crucial in safety‑critical applications like autonomous driving, which depend on rapid and precise semantic segmentation of LiDAR point clouds for real‑time 3D scene understanding. In this work, we introduce a sampling‑free approach for estimating well‑calibrated confidence values for classification tasks, achieving alignment with true classification accuracy and significantly reducing inference time compared to sampling‑based methods. Our evaluation using the Adaptive Calibration Error (ACE) metric for LiDAR semantic segmentation shows that our approach maintains well‑calibrated confidence values while achieving increased processing speed compared to a sampling baseline. Additionally, reliability diagrams reveal that our method produces underconfidence rather than overconfident predictions, an advantage for safety‑critical applications. Our sampling‑free approach offers well‑calibrated and time‑efficient predictions for LiDAR scene semantic segmentation.

Abstract:
Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio‑temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part‑conditioned segmentation, part‑conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences.

Abstract:
There is growing interest in applying AI to radiology report generation, particularly for chest X‑rays (CXRs). This paper investigates whether incorporating pixel‑level information through segmentation masks can improve fine‑grained image interpretation of multimodal large language models (MLLMs) for radiology report generation. We introduce MAIRA‑Seg, a segmentation‑aware MLLM framework designed to utilize semantic segmentation masks alongside CXRs for generating radiology reports. We train expert segmentation models to obtain mask pseudolabels for radiology‑specific structures in CXRs. Subsequently, building on the architectures of MAIRA, a CXR‑specialised model for report generation, we integrate a trainable segmentation tokens extractor that leverages these mask pseudolabels, and employ mask‑aware prompting to generate draft radiology reports. Our experiments on the publicly available MIMIC‑CXR dataset show that MAIRA‑Seg outperforms non‑segmentation baselines. We also investigate set‑of‑marks prompting with MAIRA and find that MAIRA‑Seg consistently demonstrates comparable or superior performance. The results confirm that using segmentation masks enhances the nuanced reasoning of MLLMs, potentially contributing to better clinical outcomes.

Abstract:
Underwater surveys provide long‑term data for informing management strategies, monitoring coral reef health, and estimating blue carbon stocks. Advances in broad‑scale survey methods, such as robotic underwater vehicles, have increased the range of marine surveys but generate large volumes of imagery requiring analysis. Computer vision methods such as semantic segmentation aid automated image analysis, but typically rely on fully supervised training with extensive labelled data. While ground truth label masks for tasks like street scene segmentation can be quickly and affordably generated by non‑experts through crowdsourcing services like Amazon Mechanical Turk, ecology presents greater challenges. The complexity of underwater images, coupled with the specialist expertise needed to accurately identify species at the pixel level, makes this process costly, time‑consuming, and heavily dependent on domain experts. In recent years, some works have performed automated analysis of underwater imagery, and a smaller number of studies have focused on weakly supervised approaches which aim to reduce the expert‑provided labelled data required. This survey focuses on approaches which reduce dependency on human expert input, while reviewing the prior and related approaches to position these works in the wider field of underwater perception. Further, we offer an overview of coastal ecosystems and the challenges of underwater imagery. We provide background on weakly and self‑supervised deep learning and integrate these elements into a taxonomy that centres on the intersection of underwater monitoring, computer vision, and deep learning, while motivating approaches for weakly supervised deep learning with reduced dependency on domain expert data annotations. Lastly, the survey examines available datasets and platforms, and identifies gaps, barriers, and opportunities for automating underwater surveys.

Abstract:
Currently, deep learning‑based instance segmentation for various applications (e.g., Agriculture) is predominantly performed using a labor‑intensive process involving extensive field data collection using sophisticated sensors, followed by careful manual annotation of images, presenting significant logistical and financial challenges to researchers and organizations. The process also slows down the model development and training process. In this study, we presented a novel method for deep learning‑based instance segmentation of apples in commercial orchards that eliminates the need for labor‑intensive field data collection and manual annotation. Utilizing a Large Language Model (LLM), we synthetically generated orchard images and automatically annotated them using the Segment Anything Model (SAM) integrated with a YOLO11 base model. This method significantly reduces reliance on physical sensors and manual data processing, presenting a major advancement in "Agricultural AI". The synthetic, auto‑annotated dataset was used to train the YOLO11 model for Apple instance segmentation, which was then validated on real orchard images. The results showed that the automatically generated annotations achieved a Dice Coefficient of 0.9513 and an IoU of 0.9303, validating the accuracy and overlap of the mask annotations. All YOLO11 configurations, trained solely on these synthetic datasets with automated annotations, accurately recognized and delineated apples, highlighting the method's efficacy. Specifically, the YOLO11m‑seg configuration achieved a mask precision of 0.902 and a mask mAP@50 of 0.833 on test images collected from a commercial orchard. Additionally, the YOLO11l‑seg configuration outperformed other models in validation on 40 LLM‑generated images, achieving the highest mask precision and mAP@50 metrics. Keywords: YOLO, SAM, SAMv2, YOLO11, YOLOv11, Segment Anything, YOLO‑SAM

Abstract:
Lane detection involves identifying lanes on the road and accurately determining their location and shape. This is a crucial technique for modern assisted and autonomous driving systems. However, several unique properties of lanes pose challenges for detection methods. The lack of distinctive features can cause lane detection algorithms to be confused by other objects with similar appearances. Additionally, the varying number of lanes and the diversity in lane line patterns, such as solid, broken, single, double, merging, and splitting lines, further complicate the task. To address these challenges, Deep Learning (DL) approaches can be employed in various ways. Merging DL models with an attention mechanism has recently surfaced as a new approach. In this context, two deep learning‑based lane recognition methods are proposed in this study. The first method employs the Feature Pyramid Network (FPN) model, delivering an impressive 87.59% accuracy in detecting road lanes. The second method, which incorporates attention layers into the U‑Net model, significantly boosts the performance of semantic segmentation tasks. The advanced model, achieving an extraordinary 98.98% accuracy and far surpassing the basic U‑Net model, clearly showcases its superiority over existing methods in a comparative analysis. The groundbreaking findings of this research pave the way for the development of more effective and reliable road lane detection methods, significantly advancing the capabilities of modern assisted and autonomous driving systems.

Abstract:
This paper presents a novel method for discovering systematic errors in segmentation models. For instance, a systematic error in the segmentation model can be a sufficiently large number of misclassifications from the model as a parking meter for a target class of pedestrians. With the rapid deployment of these models in critical applications such as autonomous driving, it is vital to detect and interpret these systematic errors. However, the key challenge is automatically discovering such failures on unlabelled data and forming interpretable semantic sub‑groups for intervention. For this, we leverage multimodal foundation models to retrieve errors and use conceptual linkage along with erroneous nature to study the systematic nature of these errors. We demonstrate that such errors are present in SOTA segmentation models (UperNet ConvNeXt and UperNet Swin) trained on the Berkeley Deep Drive and benchmark the approach qualitatively and quantitatively, showing its effectiveness by discovering coherent systematic errors for these models. Our work opens up the avenue to model analysis and intervention that have so far been underexplored in semantic segmentation.

Abstract:
This study introduces a diffusion‑based framework for robust and accurate segmenton of vertebrae, intervertebral discs (IVDs), and spinal canal from Magnetic Resonance Imaging~(MRI) scans of patients with low back pain (LBP), regardless of whether the scans are T1w or T2‑weighted. The results showed that SpineSegDiff achieved comparable outperformed non‑diffusion state‑of‑the‑art models in the identification of degenerated IVDs. Our findings highlight the potential of diffusion models to improve LBP diagnosis and management through precise spine MRI analysis.

Abstract:
We present Y‑MAP‑Net, a Y‑shaped neural network architecture designed for real‑time multi‑task learning on RGB images. Y‑MAP‑Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi‑label captions, all from a single network evaluation. To achieve this, we adopt a multi‑teacher, single‑student training paradigm, where task‑specific foundation models supervise the network's learning, enabling it to distill their capabilities into a lightweight architecture suitable for real‑time applications. Y‑MAP‑Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.

Abstract:
In this paper, we introduce Motion‑Grounded Video Reasoning, a new motion understanding task that requires generating visual answers (video segmentation masks) according to the input question, and hence needs implicit spatiotemporal reasoning and grounding. This task extends existing spatiotemporal grounding work focusing on explicit action/motion grounding, to a more general format by enabling implicit reasoning via questions. To facilitate the development of the new task, we collect a large‑scale dataset called GROUNDMORE, which comprises 1,715 video clips, 249K object masks that are deliberately designed with 4 question types (Causal, Sequential, Counterfactual, and Descriptive) for benchmarking deep and comprehensive motion reasoning abilities. GROUNDMORE uniquely requires models to generate visual answers, providing a more concrete and visually interpretable response than plain texts. It evaluates models on both spatiotemporal grounding and reasoning, fostering to address complex challenges in motion‑related video reasoning, temporal perception, and pixel‑level understanding. Furthermore, we introduce a novel baseline model named Motion‑Grounded Video Reasoning Assistant (MORA). MORA incorporates the multimodal reasoning ability from the Multimodal LLM, the pixel‑level perception capability from the grounding model (SAM), and the temporal perception ability from a lightweight localization head. MORA achieves respectable performance on GROUNDMORE outperforming the best existing visual grounding baseline model by an average of 21.5% relatively. We hope this novel and challenging task will pave the way for future advancements in robust and general motion understanding via video reasoning segmentation

Abstract:
The primary value of infrared and visible image fusion technology lies in applying the fusion results to downstream tasks. However, existing methods face challenges such as increased training complexity and significantly compromised performance of individual tasks when addressing multiple downstream tasks simultaneously. To tackle this, we propose Task‑Oriented Adaptive Regulation (T‑OAR), an adaptive mechanism specifically designed for multi‑task environments. Additionally, we introduce the Task‑related Dynamic Prompt Injection (T‑DPI) module, which generates task‑specific dynamic prompts from user‑input text instructions and integrates them into target representations. This guides the feature extraction module to produce representations that are more closely aligned with the specific requirements of downstream tasks. By incorporating the T‑DPI module into the T‑OAR framework, our approach generates fusion images tailored to task‑specific requirements without the need for separate training or task‑specific weights. This not only reduces computational costs but also enhances adaptability and performance across multiple tasks. Experimental results show that our method excels in object detection, semantic segmentation, and salient object detection, demonstrating its strong adaptability, flexibility, and task specificity. This provides an efficient solution for image fusion in multi‑task environments, highlighting the technology's potential across diverse applications.

Abstract:
Improving hyperspectral image (HSI) semantic segmentation by exploiting complementary information from a supplementary data type (referred to X‑modality) is promising but challenging due to differences in imaging sensors, image content, and resolution. Current techniques struggle to enhance modality‑specific and modality‑shared information, as well as to capture dynamic interaction and fusion between different modalities. In response, this study proposes CoMiX, an asymmetric encoder‑decoder architecture with deformable convolutions (DCNs) for HSI‑X semantic segmentation. CoMiX is designed to extract, calibrate, and fuse information from HSI and X data. Its pipeline includes an encoder with two parallel and interacting backbones and a lightweight all‑multilayer perceptron (ALL‑MLP) decoder. The encoder consists of four stages, each incorporating 2D DCN blocks for the X model to accommodate geometric variations and 3D DCN blocks for HSIs to adaptively aggregate spatial‑spectral features. Additionally, each stage includes a Cross‑Modality Feature enhancement and eXchange (CMFeX) module and a feature fusion module (FFM). CMFeX is designed to exploit spatial‑spectral correlations from different modalities to recalibrate and enhance modality‑specific and modality‑shared features while adaptively exchanging complementary information between them. Outputs from CMFeX are fed into the FFM for fusion and passed to the next stage for further information learning. Finally, the outputs from each FFM are integrated by the ALL‑MLP decoder for final prediction. Extensive experiments demonstrate that our CoMiX achieves superior performance and generalizes well to various multimodal recognition tasks. The CoMiX code will be released.

Abstract:
The Segment Anything Model (SAM) and similar models build a family of promptable foundation models (FMs) for image and video segmentation. The object of interest is identified using prompts, such as bounding boxes or points. With these FMs becoming part of medical image segmentation, extensive evaluation studies are required to assess their strengths and weaknesses in clinical setting. Since the performance is highly dependent on the chosen prompting strategy, it is important to investigate different prompting techniques to define optimal guidelines that ensure effective use in medical image segmentation. Currently, no dedicated evaluation studies exist specifically for bone segmentation in CT scans, leaving a gap in understanding the performance for this task. Thus, we use non‑iterative, ``optimal'' prompting strategies composed of bounding box, points and combinations to test the zero‑shot capability of SAM‑family models for bone CT segmentation on three different skeletal regions. Our results show that the best settings depend on the model type and size, dataset characteristics and objective to optimize. Overall, SAM and SAM2 prompted with a bounding box in combination with the center point for all the components of an object yield the best results across all tested settings. As the results depend on multiple factors, we provide a guideline for informed decision‑making in 2D prompting with non‑interactive, ''optimal'' prompts.

Abstract:
Morphological methods play a crucial role in remote sensing image processing, due to their ability to capture and preserve small structural details. However, most of the existing deep learning models for semantic segmentation are based on the encoder‑decoder architecture including U‑net and Segment Anything Model (SAM), where the downsampling process tends to discard fine details. In this paper, we propose a new approach that integrates learnable morphological skeleton prior into deep neural networks using the variational method. To address the difficulty in backpropagation in neural networks caused by the non‑differentiability presented in classical morphological operations, we provide a smooth representation of the morphological skeleton and design a variational segmentation model integrating morphological skeleton prior by employing operator splitting and dual methods. Then, we integrate this model into the network architecture of SAM, which is achieved by adding a token to mask decoder and modifying the final sigmoid layer, ensuring the final segmentation results preserve the skeleton structure as much as possible. Experimental results on remote sensing datasets, including buildings, roads and water, demonstrate that our method outperforms the original SAM on slender object segmentation and exhibits better generalization capability.

Abstract:
This paper introduces a novel framework for unified incremental few‑shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask‑DINO into a two‑stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine‑tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over‑fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state‑of‑the‑art approaches.

Abstract:
In this paper we present three different applications, based on deep learning methodologies, that we are developing to support the scientific analysis conducted within the ASKAP‑EMU and MeerKAT radio surveys. One employs instance segmentation frameworks to detect compact and extended radio sources and imaging artefacts from radio continuum images. Another application uses gradient boosting decision trees and convolutional neural networks to classify compact sources into different astronomical classes using combined radio and infrared multi‑band images. Finally, we discuss how self‑supervised learning can be used to obtain valuable radio data representations for source detection, and classification studies.

Abstract:
Mueller matrix polarimetry captures essential information about polarized light interactions with a sample, presenting unique challenges for data augmentation in deep learning due to its distinct structure. While augmentations are an effective and affordable way to enhance dataset diversity and reduce overfitting, standard transformations like rotations and flips do not preserve the polarization properties in Mueller matrix images. To this end, we introduce a versatile simulation framework that applies physically consistent rotations and flips to Mueller matrices, tailored to maintain polarization fidelity. Our experimental results across multiple datasets reveal that conventional augmentations can lead to falsified results when applied to polarimetric data, underscoring the necessity of our physics‑based approach. In our experiments, we first compare our polarization‑specific augmentations against real‑world captures to validate their physical consistency. We then apply these augmentations in a semantic segmentation task, achieving substantial improvements in model generalization and performance. This study underscores the necessity of physics‑informed data augmentation for polarimetric imaging in deep learning (DL), paving the way for broader adoption and more robust applications across diverse research in the field. In particular, our framework unlocks the potential of DL models for polarimetric datasets with limited sample sizes. Our code implementation is available at github.com/hahnec/polar_augment.

Abstract:
Accurate and consistent fruit monitoring over time is a key step toward automated agricultural production systems. However, this task is inherently difficult due to variations in fruit size, shape, occlusion, orientation, and the dynamic nature of orchards where fruits may appear or disappear between observations. In this article, we propose a novel method for fruit instance segmentation and re‑identification on 3D terrestrial point clouds collected over time. Our approach directly operates on dense colored point clouds, capturing fine‑grained 3D spatial detail. We segment individual fruits using a learning‑based instance segmentation method applied directly to the point cloud. For each segmented fruit, we extract a compact and discriminative descriptor using a 3D sparse convolutional neural network. To track fruits across different times, we introduce an attention‑based matching network that associates fruits with their counterparts from previous sessions. Matching is performed using a probabilistic assignment scheme, selecting the most likely associations across time. We evaluate our approach on real‑world datasets of strawberries and apples, demonstrating that it outperforms existing methods in both instance segmentation and temporal re‑identification, enabling robust and precise fruit monitoring across complex and dynamic orchard environments. Keywords = Agricultural Robotics, 3D Fruit Tracking, Instance Segmentation, Deep Learning , Point Clouds, Sparse Convolutional Networks, Temporal Monitoring

Abstract:
Presently, deep learning and convolutional neural networks (CNNs) are widely used in the fields of image processing, image classification, object identification and many more. In this work, we implemented convolutional neural network based modified U‑Net model and VGG‑UNet model to automatically identify objects from satellite imagery captured using high resolution Indian remote sensing satellites and then to pixel wise classify satellite data into various classes. In this paper, Cartosat 2S (~1m spatial resolution) datasets were used and deep learning models were implemented to detect building shapes and ships from the test datasets with an accuracy of more than 95%. In another experiment, microwave data (varied resolution) from RISAT‑1 was taken as an input and ships and trees were detected with an accuracy of >96% from these datasets. For the classification of images into multiple‑classes, deep learning model was trained on multispectral Cartosat images. Model generated results were then tested using ground truth. Multi‑label classification results were obtained with an accuracy (IoU) of better than 95%. Total six different problems were attempted using deep learning models and IoU accuracies in the range of 85% to 98% were achieved depending on the degree of complexity.

Abstract:
We introduce GaussianCut, a new method for interactive multiview segmentation of scenes represented as 3D Gaussians. Our approach allows for selecting the objects to be segmented by interacting with a single view. It accepts intuitive user input, such as point clicks, coarse scribbles, or text. Using 3D Gaussian Splatting (3DGS) as the underlying scene representation simplifies the extraction of objects of interest which are considered to be a subset of the scene's Gaussians. Our key idea is to represent the scene as a graph and use the graph‑cut algorithm to minimize an energy function to effectively partition the Gaussians into foreground and background. To achieve this, we construct a graph based on scene Gaussians and devise a segmentation‑aligned energy function on the graph to combine user inputs with scene properties. To obtain an initial coarse segmentation, we leverage 2D image/video segmentation models and further refine these coarse estimates using our graph construction. Our empirical evaluations show the adaptability of GaussianCut across a diverse set of scenes. GaussianCut achieves competitive performance with state‑of‑the‑art approaches for 3D segmentation without requiring any additional segmentation‑aware training.

Abstract:
High‑speed video (HSV) phase detection (PD) segmentation is crucial for monitoring vapor, liquid, and microlayer phases in industrial processes. While CNN‑based models like U‑Net have shown success in simplified shadowgraphy‑based two‑phase flow (TPF) analysis, their application to complex HSV PD tasks remains unexplored, and vision foundation models (VFMs) have yet to address the complexities of either shadowgraphy‑based or PD TPF video segmentation. Existing uncertainty quantification (UQ) methods lack pixel‑level reliability for critical metrics like contact line density and dry area fraction, and the absence of large‑scale, multimodal experimental datasets tailored to PD segmentation further impedes progress. To address these gaps, we propose MSEG‑VCUQ. This hybrid framework integrates U‑Net CNNs with the transformer‑based Segment Anything Model (SAM) to achieve enhanced segmentation accuracy and cross‑modality generalization. Our approach incorporates systematic UQ for robust error assessment and introduces the first open‑source multimodal HSV PD datasets. Empirical results demonstrate that MSEG‑VCUQ outperforms baseline CNNs and VFMs, enabling scalable and reliable PD segmentation for real‑world boiling dynamics.

Abstract:
This study introduces a novel data‑centric approach to improve real‑time surgical guidance using fiber‑based fluorescence lifetime imaging (FLIm). A key aspect of the methodology is the accurate detection of the aiming beam, which is essential for localizing points used to map FLIm measurements onto the tissue region within the surgical field. The primary challenge arises from the complex and variable conditions encountered in the surgical environment, particularly in Transoral Robotic Surgery (TORS). Uneven illumination in the surgical field can cause reflections, reduce contrast, and results in inconsistent color representation, further complicating aiming beam detection. To overcome these challenges, an instance segmentation model was developed using a data‑centric training strategy that improves accuracy by minimizing label noise and enhancing detection robustness. The model was evaluated on a dataset comprising 40 in vivo surgical videos, demonstrating a median detection rate of 85%. This performance was maintained when the model was integrated in a clinical system, achieving a similar detection rate of 85% during TORS procedures conducted in patients. The system's computational efficiency, measured at approximately 24 frames per second (FPS), was sufficient for real‑time surgical guidance. This study enhances the reliability of FLIm‑based aiming beam detection in complex surgical environments, advancing the feasibility of real‑time, image‑guided interventions for improved surgical precision

Abstract:
The ambiguity at the boundaries of different semantic classes in point cloud semantic segmentation often leads to incorrect decisions in intelligent perception systems, such as autonomous driving. Hence, accurate delineation of the boundaries is crucial for improving safety in autonomous driving. A novel spatial inter‑correlation enhancement and spatially‑embedded feature fusion network (SIESEF‑FusionNet) is proposed in this paper, enhancing spatial inter‑correlation by combining inverse distance weighting and angular compensation to extract more beneficial spatial information without causing redundancy. Meanwhile, a new spatial adaptive pooling module is also designed, embedding enhanced spatial information into semantic features for strengthening the context‑awareness of semantic features. Experimental results demonstrate that 83.7% mIoU and 97.8% OA are achieved by SIESEF‑FusionNet on the Toronto3D dataset, with performance superior to other baseline methods. A value of 61.1% mIoU is reached on the semanticKITTI dataset, where a marked improvement in segmentation performance is observed. In addition, the effectiveness and plug‑and‑play capability of the proposed modules are further verified through ablation studies.

Abstract:
Kolmogorov‑Arnold Networks(KANs), as a theoretically efficient neural network architecture, have garnered attention for their potential in capturing complex patterns. However, their application in computer vision remains relatively unexplored. This study first analyzes the potential of KAN in computer vision tasks, evaluating the performance of KAN and its convolutional variants in image classification and semantic segmentation. The focus is placed on examining their characteristics across varying data scales and noise levels. Results indicate that while KAN exhibits stronger fitting capabilities, it is highly sensitive to noise, limiting its robustness. To address this challenge, we propose a smoothness regularization method and introduce a Segment Deactivation technique. Both approaches enhance KAN's stability and generalization, demonstrating its potential in handling complex visual data tasks.

Abstract:
Off‑road environments pose significant perception challenges for high‑speed autonomous navigation due to unstructured terrain, degraded sensing conditions, and domain‑shifts among biomes. Learning semantic information across these conditions and biomes can be challenging when a large amount of ground truth data is required. In this work, we propose an approach that leverages a pre‑trained Vision Transformer (ViT) with fine‑tuning on a small (<500 images), sparse and coarsely labeled (<30% pixels) multi‑biome dataset to predict 2D semantic segmentation classes. These classes are fused over time via a novel range‑based metric and aggregated into a 3D semantic voxel map. We demonstrate zero‑shot out‑of‑biome 2D semantic segmentation on the Yamaha (52.9 mIoU) and Rellis (55.5 mIoU) datasets along with few‑shot coarse sparse labeling with existing data for improved segmentation performance on Yamaha (66.6 mIoU) and Rellis (67.2 mIoU). We further illustrate the feasibility of using a voxel map with a range‑based semantic fusion approach to handle common off‑road hazards like pop‑up hazards, overhangs, and water features.

Abstract:
For many years, image over‑segmentation into superpixels has been essential to computer vision pipelines, by creating homogeneous and identifiable regions of similar sizes. Such constrained segmentation problem would require a clear definition and specific evaluation criteria. However, the validation framework for superpixel methods, typically viewed as standard object segmentation, has rarely been thoroughly studied. In this work, we first take a step back to show that superpixel segmentation is fundamentally an ill‑posed problem, due to the implicit regularity constraint on the shape and size of superpixels. We also demonstrate through a novel comprehensive study that the literature suffers from only evaluating certain aspects, sometimes incorrectly and with inappropriate metrics. Concurrently, recent deep learning‑based superpixel methods mainly focus on the object segmentation task at the expense of regularity. In this ill‑posed context, we show that we can achieve competitive results using a recent architecture like the Segment Anything Model (SAM), without dedicated training for the superpixel segmentation task. This leads to rethinking superpixel segmentation and the necessary properties depending on the targeted downstream task.

Abstract:
Recent self‑supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing images. However, most remote sensing images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic remote sensing images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the Pattern Integration and Enhancement Vision Transformer (PIEViT), a novel self‑supervised learning framework designed specifically for remote sensing imagery. PIEViT utilizes a teacher‑student architecture to address both image‑level and patch‑level tasks. It employs the Geospatial Pattern Cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. The Feature Integration Projection (FIP) module further refines masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self‑supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for remote sensing image interpretation tasks.

Abstract:
Rapid ice recession in the Arctic Ocean, with predictions of ice‑free summers by 2060, opens new maritime routes but requires reliable navigation solutions. Current approaches rely heavily on subjective expert judgment, underscoring the need for automated, data‑driven solutions. This study leverages machine learning to assess ice conditions using ship‑borne optical data, introducing a finely annotated dataset of 946 images, and a semi‑manual, region‑based annotation technique. The proposed video segmentation model, UPerFlow, advances the SegFlow architecture by incorporating a six‑channel ResNet encoder, two UPerNet‑based segmentation decoders for each image, PWCNet as the optical flow encoder, and cross‑connections that integrate bi‑directional flow features without loss of latent information. The proposed architecture outperforms baseline image segmentation networks by an average 38% in occluded regions, demonstrating the robustness of video segmentation in addressing challenging Arctic conditions.

Abstract:
Fine‑grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video‑based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel‑level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine‑grained pixel‑level grounding in videos based on user‑provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio‑temporal decoder for accurate mask generation. This connection is facilitated via tunable V‑L and L‑V adapters that enable close Vision‑Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine‑grained grounding, we curate a multimodal dataset featuring detailed visually‑grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video‑QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.

Abstract:
Large‑scale foundation models like CLIP have shown strong zero‑shot generalization but struggle with domain shifts, limiting their adaptability. In our work, we introduce \textscStyLIP, a novel domain‑agnostic prompt learning strategy for Domain Generalization (DG). StyLIP disentangles visual style and content in CLIP`s vision encoder by using style projectors to learn domain‑specific prompt tokens and combining them with content features. Trained contrastively, this approach enables seamless adaptation across domains, outperforming state‑of‑the‑art methods on multiple DG benchmarks. Additionally, we propose AD‑CLIP for unsupervised domain adaptation (DA), leveraging CLIP`s frozen vision backbone to learn domain‑invariant prompts through image style and content features. By aligning domains in embedding space with entropy minimization, AD‑CLIP effectively handles domain shifts, even when only target domain samples are available. Lastly, we outline future work on class discovery using prompt learning for semantic segmentation in remote sensing, focusing on identifying novel or rare classes in unstructured environments. This paves the way for more adaptive and generalizable models in complex, real‑world scenarios.

Abstract:
Facade semantic segmentation is a long‑standing challenge in photogrammetry and computer vision. Although the last decades have witnessed the influx of facade segmentation methods, there is a lack of comprehensive facade classes and data covering the architectural variability. In ZAHA, we introduce Level of Facade Generalization (LoFG), novel hierarchical facade classes designed based on international urban modeling standards, ensuring compatibility with real‑world challenging classes and uniform methods' comparison. Realizing the LoFG, we present to date the largest semantic 3D facade segmentation dataset, providing 601 million annotated points at five and 15 classes of LoFG2 and LoFG3, respectively. Moreover, we analyze the performance of baseline semantic segmentation methods on our introduced LoFG classes and data, complementing it with a discussion on the unresolved challenges for facade segmentation. We firmly believe that ZAHA shall facilitate further development of 3D facade semantic segmentation methods, enabling robust segmentation indispensable in creating urban digital twins.

Abstract:
The proliferation of 2D foundation models has sparked research into adapting them for open‑world 3D instance segmentation. Recent methods introduce a paradigm that leverages superpoints as geometric primitives and incorporates 2D multi‑view masks from Segment Anything model (SAM) as merging guidance, achieving outstanding zero‑shot instance segmentation results. However, the limited use of 3D priors restricts the segmentation performance. Previous methods calculate the 3D superpoints solely based on estimated normal from spatial coordinates, resulting in under‑segmentation for instances with similar geometry. Besides, the heavy reliance on SAM and hand‑crafted algorithms in 2D space suffers from over‑segmentation due to SAM's inherent part‑level segmentation tendency. To address these issues, we propose SA3DIP, a novel method for Segmenting Any 3D Instances via exploiting potential 3D Priors. Specifically, on one hand, we generate complementary 3D primitives based on both geometric and textural priors, which reduces the initial errors that accumulate in subsequent procedures. On the other hand, we introduce supplemental constraints from the 3D space by using a 3D detector to guide a further merging process. Furthermore, we notice a considerable portion of low‑quality ground truth annotations in ScanNetV2 benchmark, which affect the fair evaluations. Thus, we present ScanNetV2‑INS with complete ground truth labels and supplement additional instances for 3D class‑agnostic instance segmentation. Experimental evaluations on various 2D‑3D datasets demonstrate the effectiveness and robustness of our approach. Our code and proposed ScanNetV2‑INS dataset are available HERE.

Abstract:
Semantic scene completion (SSC) is essential for achieving comprehensive perception in autonomous driving systems. However, existing SSC methods often overlook the high deployment costs in real‑world applications. Traditional architectures, such as 3D Convolutional Neural Networks (3D CNNs) and self‑attention mechanisms, face challenges in efficiently capturing long‑range dependencies within 3D voxel grids, limiting their effectiveness. To address these issues, we introduce MetaSSC, a novel meta‑learning‑based framework for SSC that leverages deformable convolution, large‑kernel attention, and the Mamba (D‑LKA‑M) model. Our approach begins with a voxel‑based semantic segmentation (SS) pretraining task, aimed at exploring the semantics and geometry of incomplete regions while acquiring transferable meta‑knowledge. Using simulated cooperative perception datasets, we supervise the perception training of a single vehicle using aggregated sensor data from multiple nearby connected autonomous vehicles (CAVs), generating richer and more comprehensive labels. This meta‑knowledge is then adapted to the target domain through a dual‑phase training strategy that does not add extra model parameters, enabling efficient deployment. To further enhance the model's capability in capturing long‑sequence relationships within 3D voxel grids, we integrate Mamba blocks with deformable convolution and large‑kernel attention into the backbone network. Extensive experiments demonstrate that MetaSSC achieves state‑of‑the‑art performance, significantly outperforming competing models while also reducing deployment costs.

Abstract:
Fibrotic Lung Disease (FLD) is a severe condition marked by lung stiffening and scarring, leading to respiratory decline. High‑resolution computed tomography (HRCT) is critical for diagnosing and monitoring FLD; however, fibrosis appears as irregular, diffuse patterns with unclear boundaries, leading to high inter‑observer variability and time‑intensive manual annotation. To tackle this challenge, we propose DiffSeg, a novel weakly supervised semantic segmentation (WSSS) method that uses image‑level annotations to generate pixel‑level fibrosis segmentation, reducing the need for fine‑grained manual labeling. Additionally, our DiffSeg incorporates a diffusion‑based generative model to synthesize HRCT images with different levels of fibrosis from healthy slices, enabling the generation of the fibrosis‑injected slices and their paired fibrosis location. Experiments indicate that our method significantly improves the accuracy of pseudo masks generated by existing WSSS methods, greatly reducing the complexity of manual labeling and enhancing the consistency of the generated masks.

Abstract:
This paper introduces a methodology for generating synthetic annotated data to address data scarcity in semantic segmentation tasks within the precision agriculture domain. Utilizing Denoising Diffusion Probabilistic Models (DDPMs) and Generative Adversarial Networks (GANs), we propose a dual diffusion model architecture for synthesizing realistic annotated agricultural data, without any human intervention. We employ super‑resolution to enhance the phenotypic characteristics of the synthesized images and their coherence with the corresponding generated masks. We showcase the utility of the proposed method for wheat head segmentation. The high quality of synthesized data underscores the effectiveness of the proposed methodology in generating image‑mask pairs. Furthermore, models trained on our generated data exhibit promising performance when tested on an external, diverse dataset of real wheat fields. The results show the efficacy of the proposed methodology for addressing data scarcity for semantic segmentation tasks. Moreover, the proposed approach can be readily adapted for various segmentation tasks in precision agriculture and beyond.

Abstract:
LiDAR Semantic Segmentation is a fundamental task in autonomous driving perception consisting of associating each LiDAR point to a semantic label. Fully‑supervised models have widely tackled this task, but they require labels for each scan, which either limits their domain or requires impractical amounts of expensive annotations. Camera images, which are generally recorded alongside LiDAR pointclouds, can be processed by the widely available 2D foundation models, which are generic and dataset‑agnostic. However, distilling knowledge from 2D data to improve LiDAR perception raises domain adaptation challenges. For example, the classical perspective projection suffers from the parallax effect produced by the position shift between both sensors at their respective capture times. We propose a Semi‑Supervised Learning setup to leverage unlabeled LiDAR pointclouds alongside distilled knowledge from the camera images. To self‑supervise our model on the unlabeled scans, we add an auxiliary NeRF head and cast rays from the camera viewpoint over the unlabeled voxel features. The NeRF head predicts densities and semantic logits at each sampled ray location which are used for rendering pixel semantics. Concurrently, we query the Segment‑Anything (SAM) foundation model with the camera image to generate a set of unlabeled generic masks. We fuse the masks with the rendered pixel semantics from LiDAR to produce pseudo‑labels that supervise the pixel predictions. During inference, we drop the NeRF head and run our model with only LiDAR. We show the effectiveness of our approach in three public LiDAR Semantic Segmentation benchmarks: nuScenes, SemanticKITTI and ScribbleKITTI.

Abstract:
Accurate and consistent mapping of urban and rural areas is crucial for sustainable development, spatial planning, and policy design. It is particularly important in simulating the complex interactions between human activities and natural resources. Existing global urban‑rural datasets such as such as GHSL‑SMOD, GHS Degree of Urbanisation, and GRUMP are often spatially coarse, methodologically inconsistent, and poorly adapted to heterogeneous regions such as Africa, which limits their usefulness for policy and research. Their coarse grids and rule‑based classification methods obscure small or informal settlements, and produce inconsistencies between countries. In this study, we develop a DeepLabV3‑based deep learning framework that integrates multi‑source data, including Landsat‑8 imagery, VIIRS nighttime lights, ESRI Land Use Land Cover (LULC), and GHS‑SMOD, to produce a 10m resolution urban‑rural map across the African continent from 2016 to 2022. The use of Landsat data also highlights the potential to extend this mapping approach historically, reaching back to the 1990s. The model employs semantic segmentation to capture fine‑scale settlement morphology, and its outputs are validated using the Demographic and Health Surveys (DHS) dataset, which provides independent, survey‑based urban‑rural labels. The model achieves an overall accuracy of 65% and a Kappa coefficient of 0.47 at the continental scale, outperforming existing global products such as SMOD. The resulting High‑Resolution Urban‑Rural (HUR) dataset provides an open and reproducible framework for mapping human settlements, enabling more context‑aware analyses of Africa's rapidly evolving settlement systems. We release a continent‑wide urban‑rural dataset covering the period from 2016 to 2022, offering a new source for high‑resolution settlement mapping in Africa.

Abstract:
Class‑incremental semantic segmentation (CSS) requires that a model learn to segment new classes without forgetting how to segment previous ones: this is typically achieved by distilling the current knowledge and incorporating the latest data. However, bypassing iterative distillation by directly transferring outputs of initial classes to the current learning task is not supported in existing class‑specific CSS methods. Via Softmax, they enforce dependency between classes and adjust the output distribution at each learning step, resulting in a large probability distribution gap between initial and current tasks. We introduce a simple, yet effective Class Independent Transformation (CIT) that converts the outputs of existing semantic segmentation models into class‑independent forms with negligible cost or performance loss. By utilizing class‑independent predictions facilitated by CIT, we establish an accumulative distillation framework, ensuring equitable incorporation of all class information. We conduct extensive experiments on various segmentation architectures, including DeepLabV3, Mask2Former, and SegViTv2. Results from these experiments show minimal task forgetting across different datasets, with less than 5% for ADE20K in the most challenging 11 task configurations and less than 1% across all configurations for the PASCAL VOC 2012 dataset.

Abstract:
In this study, 0.5m high resolution satellite datasets over Indian urban region was used to demonstrate the applicability of deep learning models over Ahmedabad, India. Here, YOLOv7 instance segmentation model was trained on well curated trees canopy dataset (6500 images) in order to carry out the change detection. During training, evaluation metrics such as bounding box regression and mask regression loss, mean average precision (mAP) and stochastic gradient descent algorithm were used for evaluating and optimizing the performance of model. After the 500 epochs, the mAP of 0.715 and 0.699 for individual tree detection and tree canopy mask segmentation were obtained. However, by further tuning hyper parameters of the model, maximum accuracy of 80 % of trees detection with false segmentation rate of 2% on data was obtained.

Abstract:
Objects, in the real world, rarely occur in isolation and exhibit typical arrangements governed by their independent utility, and their expected interaction with humans and other objects in the context. For example, a chair is expected near a table, and a computer is expected on top. Humans use this spatial context and relative placement as an important cue for visual recognition in case of ambiguities. Similar to human's, DNN's exploit contextual information from data to learn representations. Our research focuses on harnessing the contextual aspects of visual data to optimize data annotation and enhance the training of deep networks. Our contributions can be summarized as follows: (1) We introduce the notion of contextual diversity for active learning CDAL and show its applicability in three different visual tasks semantic segmentation, object detection and image classification, (2) We propose a data repair algorithm to curate contextually fair data to reduce model bias, enabling the model to detect objects out of their obvious context, (3) We propose Class‑based annotation, where contextually relevant classes are selected that are complementary for model training under domain shift. Understanding the importance of well‑curated data, we also emphasize the necessity of involving humans in the loop to achieve accurate annotations and to develop novel interaction strategies that allow humans to serve as fact‑checkers. In line with this we are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads. For large‑scale annotation, we are employing a strategic combination of human expertise and zero‑shot models, while also integrating human input at various stages for continuous feedback.

Abstract:
Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time‑consuming and resource‑intensive. Alternatively, leveraging advanced text‑to‑image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single‑instance images, as the generation of multiple instances with Stable Diffusion has proven unstable. To address this limitation and expand the scope and diversity of synthetic datasets, we propose a framework Free‑Mask that combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing for the integration of multiple objects into images via text‑to‑image models. Our method facilitates the creation of highly realistic datasets that closely emulate open‑world environments while generating accurate segmentation masks. It reduces the labor associated with manual annotation and also ensures precise mask generation. Experimental results demonstrate that synthetic data generated by Free‑Mask enables segmentation models to outperform those trained on real data, especially in zero‑shot settings. Notably, Free‑Mask achieves new state‑of‑the‑art results on previously unseen classes in the VOC 2012 benchmark.

Abstract:
Localization is one of the most crucial tasks for Unmanned Aerial Vehicle systems (UAVs) directly impacting overall performance, which can be achieved with various sensors and applied to numerous tasks related to search and rescue operations, object tracking, construction, etc. However, due to the negative effects of challenging environments, UAVs may lose signals for localization. In this paper, we present an effective path‑planning system leveraging semantic segmentation information to navigate around texture‑less and problematic areas like lakes, oceans, and high‑rise buildings using a monocular camera. We introduce a real‑time semantic segmentation architecture and a novel keyframe decision pipeline to optimize image inputs based on pixel distribution, reducing processing time. A hierarchical planner based on the Dynamic Window Approach (DWA) algorithm, integrated with a cost map, is designed to facilitate efficient path planning. The system is implemented in a photo‑realistic simulation environment using Unity, aligning with segmentation model parameters. Comprehensive qualitative and quantitative evaluations validate the effectiveness of our approach, showing significant improvements in the reliability and efficiency of UAV localization in challenging environments.

Abstract:
Recently, transformer‑based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over‑segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi‑scale feature representation and introduces a twin‑attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state‑of‑the‑art 3D instance segmentation methods.

Abstract:
Semantic segmentation is an important branch of image processing and computer vision. With the popularity of deep learning, various convolutional neural networks have been proposed for pixel‑level classification and segmentation tasks. In practical scenarios, however, imaging angles are often arbitrary, encompassing instances such as water body images from remote sensing and capillary and polyp images in the medical domain, where prior orientation information is typically unavailable to guide these networks to extract more effective features. In this case, learning features from objects with diverse orientation information poses a significant challenge, as the majority of CNN‑based semantic segmentation networks lack rotation equivariance to resist the disturbance from orientation information. To address this challenge, this paper first constructs a universal convolution‑group framework aimed at more fully utilizing orientation information and equipping the network with rotation equivariance. Subsequently, we mathematically design a padding‑based rotation equivariant convolution mode (PreCM), which is not only applicable to multi‑scale images and convolutional kernels but can also serve as a replacement component for various types of convolutions, such as dilated convolutions, transposed convolutions, and asymmetric convolution. To quantitatively assess the impact of image rotation in semantic segmentation tasks, we also propose a new evaluation metric, Rotation Difference (RD). The replacement experiments related to six existing semantic segmentation networks on three datasets show that, the average Intersection Over Union (IOU) of their PreCM‑based versions respectively improve 6.91%, 10.63%, 4.53%, 5.93%, 7.48%, 8.33% compared to their original versions in terms of random angle rotation. And the average RD values are decreased by 3.58%, 4.56%, 3.47%, 3.66%, 3.47%, 3.43% respectively.

Abstract:
Question‑answering (QA) is an important application of Information Retrieval (IR) and language models, and the latest trend is toward pre‑trained large neural networks with embedding parameters. Augmenting QA performances with these LLMs requires intensive computational resources for fine‑tuning. We propose an innovative approach to improve QA task performances by integrating optimized vector retrievals and instruction methodologies. Based on retrieval augmentation, the process involves document embedding, vector retrieval, and context construction for optimal QA results. We experiment with different combinations of text segmentation techniques and similarity functions, and analyze their impacts on QA performances. Results show that the model with a small chunk size of 100 without any overlap of the chunks achieves the best result and outperforms the models based on semantic segmentation using sentences. We discuss related QA examples and offer insight into how model performances are improved within the two‑stage framework.

Abstract:
Recent video semantic segmentation (VSS) methods have demonstrated promising results in well‑lit environments. However, their performance significantly drops in low‑light scenarios due to limited visibility and reduced contextual details. In addition, unfavorable low‑light conditions make it harder to incorporate temporal consistency across video frames and thus, lead to video flickering effects. Compared with conventional cameras, event cameras can capture motion dynamics, filter out temporal‑redundant information, and are robust to lighting conditions. To this end, we propose EVSNet, a lightweight framework that leverages event modality to guide the learning of a unified illumination‑invariant representation. Specifically, we leverage a Motion Extraction Module to extract short‑term and long‑term temporal motions from event modality and a Motion Fusion Module to integrate image features and motion features adaptively. Furthermore, we use a Temporal Decoder to exploit video contexts and generate segmentation predictions. Such designs in EVSNet result in a lightweight architecture while achieving SOTA performance. Experimental results on 3 large‑scale datasets demonstrate our proposed EVSNet outperforms SOTA methods with up to 11x higher parameter efficiency.

Abstract:
This study addresses the challenge of classifying cell shapes from noisy contours, such as those obtained through cell instance segmentation of histological images. We assess the performance of various features for shape classification, including Elliptical Fourier Descriptors, curvature features, and lower dimensional representations. Using an annotated synthetic dataset of noisy contours, we identify the most suitable shape descriptors and apply them to a set of real images for qualitative analysis. Our aim is to provide a comprehensive evaluation of descriptors for classifying cell shapes, which can support cell type identification and tissue characterization‑critical tasks in both biological research and histopathological assessments.

Abstract:
In the context of firefighting and rescue operations, a cross‑modal semantic segmentation model based on a single‑chip millimeter‑wave (mmWave) radar for indoor environmental perception is proposed and discussed. To efficiently obtain high‑quality labels, an automatic label generation method utilizing LiDAR point clouds and occupancy grid maps is introduced. The proposed segmentation model is based on U‑Net. A spatial attention module is incorporated, which enhanced the performance of the mode. The results demonstrate that cross‑modal semantic segmentation provides a more intuitive and accurate representation of indoor environments. Unlike traditional methods, the model's segmentation performance is minimally affected by azimuth. Although performance declines with increasing distance, this can be mitigated by a well‑designed model. Additionally, it was found that using raw ADC data as input is ineffective; compared to RA tensors, RD tensors are more suitable for the proposed model.

Abstract:
Underwater instance segmentation is a fundamental and critical step in various underwater vision tasks. However, the decline in image quality caused by complex underwater environments presents significant challenges to existing segmentation models. While the state‑of‑the‑art USIS‑SAM model has demonstrated impressive performance, it struggles to effectively adapt to feature variations across different channels in addressing issues such as light attenuation, color distortion, and complex backgrounds. This limitation hampers its segmentation performance in challenging underwater scenarios. To address these issues, we propose the MarineVision Adapter (MV‑Adapter). This module introduces an adaptive channel attention mechanism that enables the model to dynamically adjust the feature weights of each channel based on the characteristics of underwater images. By adaptively weighting features, the model can effectively handle challenges such as light attenuation, color shifts, and complex backgrounds. Experimental results show that integrating the MV‑Adapter module into the USIS‑SAM network architecture further improves the model's overall performance, especially in high‑precision segmentation tasks. On the USIS10K dataset, the module achieves improvements in key metrics such as mAP, AP50, and AP75 compared to competitive baseline models.

Abstract:
Utilizing patch‑based transformers for unstructured geometric data such as polygon meshes presents significant challenges, primarily due to the absence of a canonical ordering and variations in input sizes. Prior approaches to handling 3D meshes and point clouds have either relied on computationally intensive node‑level tokens for large objects or resorted to resampling to standardize patch size. Moreover, these methods generally lack a geometry‑aware, stable Structural Embedding (SE), often depending on simplistic absolute SEs such as 3D coordinates, which compromise isometry invariance essential for tasks like semantic segmentation. In our study, we meticulously examine the various components of a geometry‑aware 3D mesh transformer, from tokenization to structural encoding, assessing the contribution of each. Initially, we introduce a spectral‑preserving tokenization rooted in algebraic multigrid methods. Subsequently, we detail an approach for embedding features at the patch level, accommodating patches with variable node counts. Through comparative analyses against a baseline model employing simple point‑wise Multi‑Layer Perceptrons (MLP), our research highlights critical insights: 1) the importance of structural and positional embeddings facilitated by heat diffusion in general 3D mesh transformers; 2) the effectiveness of novel components such as geodesic masking and feature interaction via cross‑attention in enhancing learning; and 3) the superior performance and efficiency of our proposed methods in challenging segmentation and classification tasks.

Abstract:
Automated waste recycling aims to efficiently separate the recyclable objects from the waste by employing vision‑based systems. However, the presence of varying shaped objects having different material types makes it a challenging problem, especially in cluttered environments. Existing segmentation methods perform reasonably on many semantic segmentation datasets by employing multi‑contextual representations, however, their performance is degraded when utilized for waste object segmentation in cluttered scenarios. In addition, plastic objects further increase the complexity of the problem due to their translucent nature. To address these limitations, we introduce an efficacious segmentation network, named COSNet, that uses boundary cues along with multi‑contextual information to accurately segment the objects in cluttered scenes. COSNet introduces novel components including feature sharpening block (FSB) and boundary enhancement module (BEM) for enhancing the features and highlighting the boundary information of irregular waste objects in cluttered environment. Extensive experiments on three challenging datasets including ZeroWaste‑f, SpectralWaste, and ADE20K demonstrate the effectiveness of the proposed method. Our COSNet achieves a significant gain of 1.8% on ZeroWaste‑f and 2.1% on SpectralWaste datasets respectively in terms of mIoU metric.

Abstract:
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method leverages the universal visual‑language mapping learned by video diffusion models on Internet‑scale data by fine‑tuning them on small‑scale Referring Object Segmentation datasets. Our key insight is to preserve the entirety of the generative model's architecture by shifting its objective from predicting noise to predicting mask latents. The resulting model can accurately segment rare and unseen objects, despite only being trained on a limited set of categories. Additionally, it can effortlessly generalize to non‑object dynamic concepts, such as smoke or raindrops, as demonstrated in our new benchmark for Referring Video Process Segmentation (Ref‑VPS). REM performs on par with the state‑of‑the‑art on in‑domain datasets, like Ref‑DAVIS, while outperforming them by up to 12 IoU points out‑of‑domain, leveraging the power of generative pre‑training. We also show that advancements in video generation directly improve segmentation.

Abstract:
Current cardiac cine magnetic resonance image (cMR) studies focus on the end diastole (ED) and end systole (ES) phases, while ignoring the abundant temporal information in the whole image sequence. This is because whole sequence segmentation is currently a tedious process and inaccurate. Conventional whole sequence segmentation approaches first estimate the motion field between frames, which is then used to propagate the mask along the temporal axis. However, the mask propagation results could be prone to error, especially for the basal and apex slices, where through‑plane motion leads to significant morphology and structural change during the cardiac cycle. Inspired by recent advances in video object segmentation (VOS), based on spatio‑temporal memory (STM) networks, we propose a continuous STM (CSTM) network for semi‑supervised whole heart and whole sequence cMR segmentation. Our CSTM network takes full advantage of the spatial, scale, temporal and through‑plane continuity prior of the underlying heart anatomy structures, to achieve accurate and fast 4D segmentation. Results of extensive experiments across multiple cMR datasets show that our method can improve the 4D cMR segmentation performance, especially for the hard‑to‑segment regions.

Abstract:
Recent self‑supervised clustering‑based pre‑training techniques like DINO and Cribo have shown impressive results for downstream detection and segmentation tasks. However, real‑world applications such as autonomous driving face challenges with imbalanced object class and size distributions and complex scene geometries. In this paper, we propose S3PT a novel scene semantics and structure guided clustering to provide more scene‑consistent objectives for self‑supervised training. Specifically, our contributions are threefold: First, we incorporate semantic distribution consistent clustering to encourage better representation of rare classes such as motorcycles or animals. Second, we introduce object diversity consistent spatial clustering, to handle imbalanced and diverse object sizes, ranging from large background areas to small objects such as pedestrians and traffic signs. Third, we propose a depth‑guided spatial clustering to regularize learning based on geometric information of the scene, thus further refining region separation on the feature level. Our learned representations significantly improve performance in downstream semantic segmentation and 3D object detection tasks on the nuScenes, nuImages, and Cityscapes datasets and show promising domain translation properties.

Abstract:
Computer vision researchers have extensively worked on fundamental infrared visual recognition for the past few decades. Among various approaches, deep learning has emerged as the most promising candidate. However, Infrared Small Object Segmentation (ISOS) remains a major focus due to several challenges including: 1) the lack of effective utilization of local contrast and global contextual information; 2) the potential loss of small objects in deep models; and 3) the struggling to capture fine‑grained details and ignore noise. To address these challenges, we propose a modified U‑Net architecture, named SFA‑UNet, by combining Scharr Convolution (SC) and Fast Fourier Convolution (FFC) in addition to vertical and horizontal Attention gates (AG) into UNet. SFA‑UNet utilizes double convolution layers with the addition of SC and FFC in its encoder and decoder layers. SC helps to learn the foreground‑to‑background contrast information whereas FFC provide multi‑scale contextual information while mitigating the small objects vanishing problem. Additionally, the introduction of vertical AGs in encoder layers enhances the model's focus on the targeted object by ignoring irrelevant regions. We evaluated the proposed approach on publicly available, SIRST and IRSTD datasets, and achieved superior performance by an average 0.75% with variance of 0.025 of all combined metrics in multiple runs as compared to the existing state‑of‑the‑art methods

Abstract:
Contemporary state‑of‑the‑art video object segmentation (VOS) models compare incoming unannotated images to a history of image‑mask relations via affinity or cross‑attention to predict object masks. We refer to the internal memory state of the initial image‑mask pair and past image‑masks as a working memory buffer. While the current state of the art models perform very well on clean video data, their reliance on a working memory of previous frames leaves room for error. Affinity‑based algorithms include the inductive bias that there is temporal continuity between consecutive frames. To account for inconsistent camera views of the desired object, working memory models need an algorithmic modification that regulates the memory updates and avoid writing irrelevant frames into working memory. A simple algorithmic change is proposed that can be applied to any existing working memory‑based VOS model to improve performance on inconsistent views, such as sudden camera cuts, frame interjections, and extreme context changes. The resulting model performances show significant improvement on video data with these frame interjections over the same model without the algorithmic addition. Our contribution is a simple decision function that determines whether working memory should be updated based on the detection of sudden, extreme changes and the assumption that the object is no longer in frame. By implementing algorithmic changes, such as this, we can increase the real‑world applicability of current VOS models.

Abstract:
Cross‑domain few‑shot segmentation (CD‑FSS) is proposed to first pre‑train the model on a large‑scale source‑domain dataset, and then transfer the model to data‑scarce target‑domain datasets for pixel‑level segmentation. The significant domain gap between the source and target datasets leads to a sharp decline in the performance of existing few‑shot segmentation (FSS) methods in cross‑domain scenarios. In this work, we discover an intriguing phenomenon: simply filtering different frequency components for target domains can lead to a significant performance improvement, sometimes even as high as 14% mIoU. Then, we delve into this phenomenon for an interpretation, and find such improvements stem from the reduced inter‑channel correlation in feature maps, which benefits CD‑FSS with enhanced robustness against domain gaps and larger activated regions for segmentation. Based on this, we propose a lightweight frequency masker, which further reduces channel correlations by an Amplitude‑Phase Masker (APM) module and an Adaptive Channel Phase Attention (ACPA) module. Notably, APM introduces only 0.01% additional parameters but improves the average performance by over 10%, and ACPA imports only 2.5% parameters but further improves the performance by over 1.5%, which significantly surpasses the state‑of‑the‑art CD‑FSS methods.

Abstract:
Hyperspectral Imaging (HSI) is known for its advantages over traditional RGB imaging in remote sensing, agriculture, and medicine. Recently, it has gained attention for enhancing Advanced Driving Assistance Systems (ADAS) perception. Several HSI datasets such as HyKo, HSI‑Drive, HSI‑Road, and Hyperspectral City have been made available. However, a comprehensive evaluation of semantic segmentation models (SSM) using these datasets is lacking. To address this gap, we evaluated the available annotated HSI datasets on four deep learning‑based baseline SSMs: DeepLab v3+, HRNet, PSPNet, and U‑Net, along with its two variants: Coordinate Attention (UNet‑CA) and Convolutional Block‑Attention Module (UNet‑CBAM). The original model architectures were adapted to handle the varying spatial and spectral dimensions of the datasets. These baseline SSMs were trained using a class‑weighted loss function for individual HSI datasets and evaluated using mean‑based metrics such as intersection over union (IoU), recall, precision, F1 score, specificity, and accuracy. Our results indicate that UNet‑CBAM, which extracts channel‑wise features, outperforms other SSMs and shows potential to leverage spectral information for enhanced semantic segmentation. This study establishes a baseline SSM benchmark on available annotated datasets for future evaluation of HSI‑based ADAS perception. However, limitations of current HSI datasets, such as limited dataset size, high class imbalance, and lack of fine‑grained annotations, remain significant constraints for developing robust SSMs for ADAS applications.

Abstract:
Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in some uncommon conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision‑Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language‑image pre‑training model (CLIP), we propose prompt/photo‑driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low‑level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero‑shot (i.e., target‑free) and one‑shot unsupervised domain adaptation. Experiments on semantic segmentation demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the zero‑shot and one‑shot settings.

Abstract:
Deep neural networks (DNNs) have shown exceptional performance when trained on well‑illuminated images captured by Electro‑Optical (EO) cameras, which provide rich texture details. However, in critical applications like aerial perception, it is essential for DNNs to maintain consistent reliability across all conditions, including low‑light scenarios where EO cameras often struggle to capture sufficient detail. Additionally, UAV‑based aerial object detection faces significant challenges due to scale variability from varying altitudes and slant angles, adding another layer of complexity. Existing methods typically address only illumination changes or style variations as domain shifts, but in aerial perception, correlation shifts also impact DNN performance. In this paper, we introduce the IndraEye dataset, a multi‑sensor (EO‑IR) dataset designed for various tasks. It includes 5,612 images with 145,666 instances, encompassing multiple viewing angles, altitudes, seven backgrounds, and different times of the day across the Indian subcontinent. The dataset opens up several research opportunities, such as multimodal learning, domain adaptation for object detection and segmentation, and exploration of sensor‑specific strengths and weaknesses. IndraEye aims to advance the field by supporting the development of more robust and accurate aerial perception systems, particularly in challenging conditions. IndraEye dataset is benchmarked with object detection and semantic segmentation tasks. Dataset and source codes are available at https://bit.ly/indraeye.

Abstract:
In volcano monitoring, effective recognition of seismic events is essential for understanding volcanic activity and raising timely warning alerts. Traditional methods rely on manual analysis, which can be subjective and labor‑intensive. Furthermore, current automatic approaches often tackle detection and classification separately, mostly rely on single station information and generally require tailored preprocessing and representations to perform predictions. These limitations often hinder their application to real‑time monitoring and utilization across different volcano conditions. This study introduces a novel approach that utilizes Semantic Segmentation models to automate seismic event recognition by applying a straight forward transformation of multi‑channel 1D signals into 2D representations, enabling their use as images. Our framework employs a data‑driven, end‑to‑end design that integrates multi‑station seismic data with minimal preprocessing, performing both detection and classification simultaneously for five seismic event classes. We evaluated four state‑of‑the‑art segmentation models (UNet, UNet++, DeepLabV3+ and SwinUNet) on approximately 25.000 seismic events recorded at four different Chilean volcanoes: Nevados del Chillán Volcanic Complex, Laguna del Maule, Villarrica and Puyehue‑Cordón Caulle. Among these models, the UNet architecture was identified as the most effective model, achieving mean F1 and Intersection over Union (IoU) scores of up to 0.91 and 0.88, respectively, and demonstrating superior noise robustness and model flexibility to unseen volcano datasets.

Abstract:
Test‑time prompt tuning, which learns prompts online with unlabelled test samples during the inference stage, has demonstrated great potential by learning effective prompts on‑the‑fly without requiring any task‑specific annotations. However, its performance often degrades clearly along the tuning process when the prompts are continuously updated with the test data flow, and the degradation becomes more severe when the domain of test samples changes continuously. We propose HisTPT, a Historical Test‑time Prompt Tuning technique that memorizes the useful knowledge of the learnt test samples and enables robust test‑time prompt tuning with the memorized knowledge. HisTPT introduces three types of knowledge banks, namely, local knowledge bank, hard‑sample knowledge bank, and global knowledge bank, each of which works with different mechanisms for effective knowledge memorization and test‑time prompt optimization. In addition, HisTPT features an adaptive knowledge retrieval mechanism that regularizes the prediction of each test sample by adaptively retrieving the memorized knowledge. Extensive experiments show that HisTPT achieves superior prompt tuning performance consistently while handling different visual recognition tasks (e.g., image classification, semantic segmentation, and object detection) and test samples from continuously changing domains.

Abstract:
Coronary artery disease poses a significant global health challenge, often necessitating percutaneous coronary intervention (PCI) with stent implantation. Assessing stent apposition holds pivotal importance in averting and identifying PCI complications that lead to in‑stent restenosis. Here we proposed a novel three‑dimensional (3D) distance‑color‑coded assessment (DccA)for PCI stent apposition via deep‑learning‑based 3D multi‑object segmentation in intravascular optical coherence tomography (IV‑OCT). Our proposed 3D DccA accurately segments 3D vessel lumens and stents in IV‑OCT images, using a spatial matching network and dual‑layer training with style transfer. It quantifies and maps stent‑lumen distances into a 3D color space, facilitating 3D visual assessment of PCI stent apposition. Achieving over 95% segmentation precision, our proposed DccA enhances clinical evaluation of PCI stent deployment and supports personalized treatment planning.

Abstract:
While the pretraining of Foundation Models (FMs) for remote sensing (RS) imagery is on the rise, models remain restricted to a few hundred million parameters. Scaling models to billions of parameters has been shown to yield unprecedented benefits including emergent abilities, but requires data scaling and computing resources typically not available outside industry R&D labs. In this work, we pair high‑performance computing resources including Frontier supercomputer, America's first exascale system, and high‑resolution optical RS data to pretrain billion‑scale FMs. Our study assesses performance of different pretrained variants of vision Transformers across image classification, semantic segmentation and object detection benchmarks, which highlight the importance of data scaling for effective model scaling. Moreover, we discuss construction of a novel TIU pretraining dataset, model initialization, with data and pretrained models intended for public release. By discussing technical challenges and details often lacking in the related literature, this work is intended to offer best practices to the geospatial community toward efficient training and benchmarking of larger FMs.

Abstract:
This study conducted a comprehensive performance evaluation on YOLO11 (or YOLOv11) and YOLOv8, the latest in the "You Only Look Once" (YOLO) series, focusing on their instance segmentation capabilities for immature green apples in orchard environments. YOLO11n‑seg achieved the highest mask precision across all categories with a notable score of 0.831, highlighting its effectiveness in fruit detection. YOLO11m‑seg and YOLO11l‑seg excelled in non‑occluded and occluded fruitlet segmentation with scores of 0.851 and 0.829, respectively. Additionally, YOLOv11x‑seg led in mask recall for all categories, achieving a score of 0.815, with YOLO11m‑seg performing best for non‑occluded immature green fruitlets at 0.858 and YOLOv8x‑seg leading the occluded category with 0.800. In terms of mean average precision at a 50% intersection over union (mAP@50), YOLOv11m‑seg consistently outperformed, registering the highest scores for both box and mask segmentation, at 0.876 and 0.860 for the "All" class and 0.908 and 0.909 for non‑occluded immature fruitlets, respectively. YOLO11l‑seg and YOLOv8l‑seg shared the top box mAP@50 for occluded immature fruitlets at 0.847, while YOLO11m‑seg achieved the highest mask mAP@50 of 0.810. Despite the advancements in YOLO11, YOLOv8n surpassed its counterparts in image processing speed, with an impressive inference speed of 3.3 milliseconds, compared to the fastest YOLO11 series model at 4.8 milliseconds, underscoring its suitability for real‑time agricultural applications related to complex green fruit environments. (YOLOv11 segmentation)

Abstract:
Unsupervised domain adaptive semantic segmentation (UDA‑SS) aims to train a model on the source domain data (e.g., synthetic) and adapt the model to predict target domain data (e.g., real‑world) without accessing target annotation data. Most existing UDA‑SS methods only focus on inter‑domain knowledge to mitigate the data‑shift problem. However, learning the inherent structure of the images and exploring the intrinsic pixel distribution of both domains are ignored, which prevents the UDA‑SS methods from producing satisfactory performance like supervised learning. Moreover, incorporating contextual knowledge is also often overlooked. Considering these issues, in this work, we propose a UDA‑SS framework that learns both intra‑domain and context‑aware knowledge. To learn the intra‑domain knowledge, we incorporate contrastive loss in both domains, which pulls pixels of similar classes together and pushes the rest away, facilitating intra‑image‑pixel‑wise correlations. To learn context‑aware knowledge, we modify the mixing technique by leveraging contextual dependency among the classes. Moreover, we adapt the Mask Image Modeling (MIM) technique to properly use context clues for robust visual recognition, using limited information about the masked images. Comprehensive experiments validate that our proposed method improves the state‑of‑the‑art UDA‑SS methods by a margin of 0.51% mIoU and 0.54% mIoU in the adaptation of GTA‑V‑>Cityscapes and Synthia‑>Cityscapes, respectively. We open‑source our C2DA code. Code link: github.com/Masrur02/C‑Squared‑DA

Abstract:
Navigating efficiently to an object in an unexplored environment is a critical skill for general‑purpose intelligent robots. Recent approaches to this object goal navigation problem have embraced a modular strategy, integrating classical exploration algorithms‑notably frontier exploration‑with a learned semantic mapping/exploration module. This paper introduces a novel informative path planning and 3D object probability mapping approach. The mapping module computes the probability of the object of interest through semantic segmentation and a Bayes filter. Additionally, it stores probabilities for common objects, which semantically guides the exploration based on common sense priors from a large language model. The planner terminates when the current viewpoint captures enough voxels identified with high confidence as the object of interest. Although our planner follows a zero‑shot approach, it achieves state‑of‑the‑art performance as measured by the Success weighted by Path Length (SPL) and Soft SPL in the Habitat ObjectNav Challenge 2023, outperforming other works by more than 20%. Furthermore, we validate its effectiveness on real robots. Project webpage: https://ippon‑paper.github.io/

Abstract:
In vision‑based robot localization and SLAM, Visual Place Recognition (VPR) is essential. This paper addresses the problem of VPR, which involves accurately recognizing the location corresponding to a given query image. A popular approach to vision‑based place recognition relies on low‑level visual features. Despite significant progress in recent years, place recognition based on low‑level visual features is challenging when there are changes in scene appearance. To address this, end‑to‑end training approaches have been proposed to overcome the limitations of hand‑crafted features. However, these approaches still fail under drastic changes and require large amounts of labeled data to train models, presenting a significant limitation. Methods that leverage high‑level semantic information, such as objects or categories, have been proposed to handle variations in appearance. In this paper, we introduce a novel VPR approach that remains robust to scene changes and does not require additional training. Our method constructs semantic image descriptors by extracting pixel‑level embeddings using a zero‑shot, language‑driven semantic segmentation model. We validate our approach in challenging place recognition scenarios using real‑world public dataset. The experiments demonstrate that our method outperforms non‑learned image representation techniques and off‑the‑shelf convolutional neural network (CNN) descriptors. Our code is available at https: //github.com/woo‑soojin/context‑based‑vlpr.

Abstract:
We present Connected‑Component~(CC)‑Metrics, a novel semantic segmentation evaluation protocol, targeted to align existing semantic segmentation metrics to a multi‑instance detection scenario in which each connected component matters. We motivate this setup in the common medical scenario of semantic metastases segmentation in a full‑body PET/CT. We show how existing semantic segmentation metrics suffer from a bias towards larger connected components contradicting the clinical assessment of scans in which tumor size and clinical relevance are uncorrelated. To rebalance existing segmentation metrics, we propose to evaluate them on a per‑component basis thus giving each tumor the same weight irrespective of its size. To match predictions to ground‑truth segments, we employ a proximity‑based matching criterion, evaluating common metrics locally at the component of interest. Using this approach, we break free of biases introduced by large metastasis for overlap‑based metrics such as Dice or Surface Dice. CC‑Metrics also improves distance‑based metrics such as Hausdorff Distances which are uninformative for small changes that do not influence the maximum or 95th percentile, and avoids pitfalls introduced by directly combining counting‑based metrics with overlap‑based metrics as it is done in Panoptic Quality.

Abstract:
The availability of highly accurate urban airborne laser scanning (ALS) data will increase rapidly in the future, especially as acquisition costs decrease, for example through the use of drones. Current challenges in data processing are related to the limited spectral information and low point density of most ALS datasets. Another challenge will be the growing need for annotated training data, frequently produced by manual processes, to enable semantic interpretation of point clouds. This study proposes to semantically segment new high‑density (1200 points per square metre on average) multispectral ALS data with an unsupervised ground‑aware deep clustering method GroupSP inspired by the unsupervised GrowSP algorithm. GroupSP divides the scene into superpoints as a preprocessing step. The neural network is trained iteratively by grouping the superpoints and using the grouping assignments as pseudo‑labels. The predictions for the unseen data are given by over‑segmenting the test set and mapping the predicted classes into ground truth classes manually or with automated majority voting. GroupSP obtained an overall accuracy (oAcc) of 97% and a mean intersection over union (mIoU) of 80%. When compared to other unsupervised semantic segmentation methods, GroupSP outperformed GrowSP and non‑deep K‑means. However, a supervised random forest classifier outperformed GroupSP. The labelling efforts in GroupSP can be minimal; it was shown, that the GroupSP can semantically segment seven urban classes (building, high vegetation, low vegetation, asphalt, rock, football field, and gravel) with oAcc of 95% and mIoU of 75% using only 0.004% of the available annotated points in the mapping assignment. Finally, the multispectral information was examined; adding each new spectral channel improved the mIoU. Additionally, echo deviation was valuable, especially when distinguishing ground‑level classes.

Abstract:
Robots operating in unstructured environments require a comprehensive understanding of their surroundings, necessitating geometric and semantic information from sensor data. Traditional RGB‑D processing pipelines focus primarily on geometric reconstruction, limiting their ability to support advanced robotic perception, planning, and interaction. A key challenge is the lack of generalized methods for segmenting RGB‑D data into semantically meaningful components while maintaining accurate geometric representations. We introduce a novel end‑to‑end modular pipeline that integrates state‑of‑the‑art semantic segmentation, human tracking, point‑cloud fusion, and scene reconstruction. Our approach improves semantic segmentation accuracy by leveraging the foundational segmentation model SAM2 with a hybrid method that combines its mask generation with a semantic classification model, resulting in sharper masks and high classification accuracy. Compared to SegFormer and OneFormer, our method achieves a similar semantic segmentation accuracy (mIoU of 47.0% vs 45.9% in the ADE20K dataset) but provides much more precise object boundaries. Additionally, our human tracking algorithm interacts with the segmentation enabling continuous tracking even when objects leave and re‑enter the frame by object re‑identification. Our point cloud fusion approach reduces computation time by 1.81x while maintaining a small mean reconstruction error of 25.3 mm by leveraging the semantic information. We validate our approach on benchmark datasets and real‑world Kinect RGB‑D data, demonstrating improved efficiency, accuracy, and usability. Our structured representation, stored in the Universal Scene Description (USD) format, supports efficient querying, visualization, and robotic simulation, making it practical for real‑world deployment.

Abstract:
The field of autonomous navigation for unmanned ground vehicles (UGVs) is in continuous growth and increasing levels of autonomy have been reached in the last few years. However, the task becomes more challenging when the focus is on the exploration of planet surfaces such as Mars. In those situations, UGVs are forced to navigate through unstable and rugged terrains which, inevitably, open the vehicle to more hazards, accidents, and, in extreme cases, complete mission failure. The paper addresses the challenges of autonomous navigation for unmanned ground vehicles in planetary exploration, particularly on Mars, introducing a hybrid architecture for terrain traversability analysis that combines two approaches: appearance‑based and geometry‑based. The appearance‑based method uses semantic segmentation via deep neural networks to classify different terrain types. This is further refined by pixel‑level terrain roughness classification obtained from the same RGB image, assigning different costs based on the physical properties of the soil. The geometry‑based method complements the appearance‑based approach by evaluating the terrain's geometrical features, identifying hazards that may not be detectable by the appearance‑based side. The outputs of both methods are combined into a comprehensive hybrid cost map. The proposed architecture was trained on synthetic datasets and developed as a ROS2 application to integrate into broader autonomous navigation systems for harsh environments. Simulations have been performed in Unity, showing the ability of the method to assess online traversability analysis.

Abstract:
This study presents an architectural analysis of YOLOv11, the latest iteration in the YOLO (You Only Look Once) series of object detection models. We examine the models architectural innovations, including the introduction of the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid Pooling ‑ Fast), and C2PSA (Convolutional block with Parallel Spatial Attention) components, which contribute in improving the models performance in several ways such as enhanced feature extraction. The paper explores YOLOv11's expanded capabilities across various computer vision tasks, including object detection, instance segmentation, pose estimation, and oriented object detection (OBB). We review the model's performance improvements in terms of mean Average Precision (mAP) and computational efficiency compared to its predecessors, with a focus on the trade‑off between parameter count and accuracy. Additionally, the study discusses YOLOv11's versatility across different model sizes, from nano to extra‑large, catering to diverse application needs from edge devices to high‑performance computing environments. Our research provides insights into YOLOv11's position within the broader landscape of object detection and its potential impact on real‑time computer vision applications.

Abstract:
The acquisition of inductive bias through point‑level contrastive learning holds paramount significance in point cloud pre‑training. However, the square growth in computational requirements with the scale of the point cloud poses a substantial impediment to the practical deployment and execution. To address this challenge, this paper proposes an Effective Point‑level Contrastive Learning method for large‑scale point cloud understanding dubbed EPContrast, which consists of AGContrast and ChannelContrast. In practice, AGContrast constructs positive and negative pairs based on asymmetric granularity embedding, while ChannelContrast imposes contrastive supervision between channel feature maps. EPContrast offers point‑level contrastive loss while concurrently mitigating the computational resource burden. The efficacy of EPContrast is substantiated through comprehensive validation on S3DIS and ScanNetV2, encompassing tasks such as semantic segmentation, instance segmentation, and object detection. In addition, rich ablation experiments demonstrate remarkable bias induction capabilities under label‑efficient and one‑epoch training settings.

Abstract:
This paper presents a novel approach for multi‑kernel estimation by enhancing the KernelGAN algorithm, which traditionally estimates a single kernel for the entire image. We introduce Multi‑KernelGAN, which extends KernelGAN's capabilities by estimating two distinct kernels based on object segmentation masks. Our approach is validated through three distinct methods: texture‑based patch Fast Fourier Transform (FFT) calculation, detail‑based segmentation, and deep learning‑based object segmentation using YOLOv8 and the Segment Anything Model (SAM). Among these methods, the combination of YOLO and SAM yields the best results for kernel estimation. Experimental results demonstrate that our multi‑kernel estimation technique outperforms conventional single‑kernel methods in super‑resolution tasks.

Abstract:
This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO (i.e., the SOTA model for joint detection and segmentation). This phenomenon inspires us to think about a question: will the performance imbalance at the beginning layer of transformer decoder constrain the upper bound of the final performance? With this question in mind, we further conduct qualitative and quantitative pre‑experiments, which validate the negative impact of detection‑segmentation imbalance issue on the model performance. To address this issue, this paper proposes DI‑MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection‑segmentation imbalance. DI‑MaskDINO is implemented by configuring our proposed De‑Imbalance (DI) module and Balance‑Aware Tokens Optimization (BATO) module to MaskDINO. DI is responsible for generating balance‑aware query, and BATO uses the balance‑aware query to guide the optimization of the initial feature tokens. The balance‑aware query and optimized feature tokens are respectively taken as the Query and Key&Value of transformer decoder to perform joint object detection and instance segmentation. DI‑MaskDINO outperforms existing joint object detection and instance segmentation models on COCO and BDD100K benchmarks, achieving +1.2 AP^box and +0.9 AP^mask improvements compared to SOTA joint detection and segmentation model MaskDINO. In addition, DI‑MaskDINO also obtains +1.0 AP^box improvement compared to SOTA object detection model DINO and +3.0 AP^mask improvement compared to SOTA segmentation model Mask2Former.

Abstract:
Nuclei instance segmentation is an essential task in pathology image analysis, serving as the foundation for many downstream applications. The release of several public datasets has significantly advanced research in this area, yet many existing methods struggle with data imbalance issues. To address this challenge, this study introduces a data augmentation method, called NucleiMix, which is designed to balance the distribution of nuclei types by increasing the number of rare‑type nuclei within datasets. NucleiMix operates in two phases. In the first phase, it identifies candidate locations similar to the surroundings of rare‑type nuclei and inserts rare‑type nuclei into the candidate locations. In the second phase, it employs a progressive inpainting strategy using a pre‑trained diffusion model to seamlessly integrate rare‑type nuclei into their new environments in replacement of major‑type nuclei or background locations. We systematically evaluate the effectiveness of NucleiMix on three public datasets using two popular nuclei instance segmentation models. The results demonstrate the superior ability of NucleiMix to synthesize realistic rare‑type nuclei and to enhance the quality of nuclei segmentation and classification in an accurate and robust manner.

Abstract:
Plane instance segmentation from RGB‑D data is a crucial research topic for many downstream tasks. However, most existing deep‑learning‑based methods utilize only information within the RGB bands, neglecting the important role of the depth band in plane instance segmentation. Based on EfficientSAM, a fast version of SAM, we propose a plane instance segmentation network called PlaneSAM, which can fully integrate the information of the RGB bands (spectral bands) and the D band (geometric band), thereby improving the effectiveness of plane instance segmentation in a multimodal manner. Specifically, we use a dual‑complexity backbone, with primarily the simpler branch learning D‑band features and primarily the more complex branch learning RGB‑band features. Consequently, the backbone can effectively learn D‑band feature representations even when D‑band training data is limited in scale, retain the powerful RGB‑band feature representations of EfficientSAM, and allow the original backbone branch to be fine‑tuned for the current task. To enhance the adaptability of our PlaneSAM to the RGB‑D domain, we pretrain our dual‑complexity backbone using the segment anything task on large‑scale RGB‑D data through a self‑supervised pretraining strategy based on imperfect pseudo‑labels. To support the segmentation of large planes, we optimize the loss function combination ratio of EfficientSAM. In addition, Faster R‑CNN is used as a plane detector, and its predicted bounding boxes are fed into our dual‑complexity network as prompts, thereby enabling fully automatic plane instance segmentation. Experimental results show that the proposed PlaneSAM sets a new SOTA performance on the ScanNet dataset, and outperforms previous SOTA approaches in zero‑shot transfer on the 2D‑3D‑S, Matterport3D, and ICL‑NUIM RGB‑D datasets, while only incurring a 10% increase in computational overhead compared to EfficientSAM.

Abstract:
Domain adaptive semantic segmentation is the task of generating precise and dense predictions for an unlabeled target domain using a model trained on a labeled source domain. While significant efforts have been devoted to improving unsupervised domain adaptation for this task, it is crucial to note that many models rely on a strong assumption that the source data is entirely and accurately labeled, while the target data is unlabeled. In real‑world scenarios, however, we often encounter partially or noisy labeled data in source and target domains, referred to as Generalized Domain Adaptation (GDA). In such cases, we suggest leveraging weak or unlabeled data from both domains to narrow the gap between them, resulting in effective adaptation. We introduce the Generalized Gaussian‑mixture‑based (GenGMM) domain adaptation model, which harnesses the underlying data distribution in both domains to refine noisy weak and pseudo labels. The experiments demonstrate the effectiveness of our approach.

Abstract:
Small sample instance segmentation is a very challenging task, and many existing methods follow the training strategy of meta‑learning which pre‑train models on support set and fine‑tune on query set. The pre‑training phase, which is highly task related, requires a significant amount of additional training time and the selection of datasets with close proximity to ensure effectiveness. The article proposes a novel small sample instance segmentation solution from the perspective of maximizing the utilization of existing information without increasing annotation burden and training costs. The proposed method designs two modules to address the problems encountered in small sample instance segmentation. First, it helps the model fully utilize unlabeled data by learning to generate pseudo labels, increasing the number of available samples. Second, by integrating the features of text and image, more accurate classification results can be obtained. These two modules are suitable for box‑free and box‑dependent frameworks. In the way, the proposed method not only improves the performance of small sample instance segmentation, but also greatly reduce reliance on pre‑training. We have conducted experiments in three datasets from different scenes: on land, underwater and under microscope. As evidenced by our experiments, integrated image‑text corrects the confidence of classification, and pseudo labels help the model obtain preciser masks. All the results demonstrate the effectiveness and superiority of our method.

Abstract:
Mixture‑of‑Experts (MoE) models embody the divide‑and‑conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification and semantic segmentation. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful information. To address this, we introduce a shared expert to learn and capture common knowledge, serving as an effective way to construct stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research.

Abstract:
An in‑depth exploration of object detection and semantic segmentation is provided, combining theoretical foundations with practical applications. State‑of‑the‑art advancements in machine learning and deep learning are reviewed, focusing on convolutional neural networks (CNNs), YOLO architectures, and transformer‑based approaches such as DETR. The integration of artificial intelligence (AI) techniques and large language models for enhancing object detection in complex environments is examined. Additionally, a comprehensive analysis of big data processing is presented, with emphasis on model optimization and performance evaluation metrics. By bridging the gap between traditional methods and modern deep learning frameworks, valuable insights are offered for researchers, data scientists, and engineers aiming to apply AI‑driven methodologies to large‑scale object detection tasks.

Abstract:
Renal tumors, especially renal cell carcinoma (RCC), show significant heterogeneity, posing challenges for diagnosis using radiology images such as MRI, echocardiograms, and CT scans. U‑Net based deep learning techniques are emerging as a promising approach for automated medical image segmentation for minimally invasive diagnosis of renal tumors. However, current techniques need further improvements in accuracy to become clinically useful to radiologists. In this study, we present an improved U‑Net based model for end‑to‑end automated semantic segmentation of CT scan images to identify renal tumors. The model uses residual connections across convolution layers, integrates a multi‑layer feature fusion (MFF) and cross‑channel attention (CCA) within encoder blocks, and incorporates skip connections augmented with additional information derived using MFF and CCA. We evaluated our model on the KiTS19 dataset, which contains data from 210 patients. For kidney segmentation, our model achieves a Dice Similarity Coefficient (DSC) of 0.97 and a Jaccard index (JI) of 0.95. For renal tumor segmentation, our model achieves a DSC of 0.96 and a JI of 0.91. Based on a comparison of available DSC scores, our model outperforms the current leading models.

Abstract:
Volumetric medical image segmentation is a fundamental problem in medical image analysis where the objective is to accurately classify a given 3D volumetric medical image with voxel‑level precision. In this work, we propose a novel hierarchical encoder‑decoder‑based framework that strives to explicitly capture the local and global dependencies for volumetric 3D medical image segmentation. The proposed framework exploits local volume‑based self‑attention to encode the local dependencies at high resolution and introduces a novel volumetric MLP‑mixer to capture the global dependencies at low‑resolution feature representations, respectively. The proposed volumetric MLP‑mixer learns better associations among volumetric feature representations. These explicit local and global feature representations contribute to better learning of the shape‑boundary characteristics of the organs. Extensive experiments on three different datasets reveal that the proposed method achieves favorable performance compared to state‑of‑the‑art approaches. On the challenging Synapse Multi‑organ dataset, the proposed method achieves an absolute 3.82% gain over the state‑of‑the‑art approaches in terms of HD95 evaluation metrics while a similar improvement pattern is exhibited in MSD Liver and Pancreas tumor datasets. We also provide a detailed comparison between recent architectural design choices in the 2D computer vision literature by adapting them for the problem of 3D medical image segmentation. Finally, our experiments on the ZebraFish 3D cell membrane dataset having limited training data demonstrate the superior transfer learning capabilities of the proposed vMixer model on the challenging 3D cell instance segmentation task, where accurate boundary prediction plays a vital role in distinguishing individual cell instances.

Abstract:
Recent research has investigated the shape and texture biases of pre‑trained deep neural networks (DNNs) in image classification. Those works test how much a trained DNN relies on specific image cues like texture. The present study shifts the focus to understanding the cue influence during training, analyzing what DNNs can learn from shape, texture, and color cues in absence of the others; investigating their individual and combined influence on the learning success. We analyze these cue influences at multiple levels by decomposing datasets into cue‑specific versions. Addressing semantic segmentation, we learn the given task from these reduced cue datasets, creating cue experts. Early fusion of cues is performed by constructing appropriate datasets. This is complemented by a late fusion of experts which allows us to study cue influence location‑dependent on pixel level. Experiments on Cityscapes, PASCAL Context, and a synthetic CARLA dataset show that while no single cue dominates, the shape + color expert predominantly improves the prediction of small objects and border pixels. The cue performance order is consistent for the tested convolutional and transformer architecture, indicating similar cue extraction capabilities, although pre‑trained transformers are said to be more biased towards shape than convolutional neural networks.

Abstract:
Road Extraction is a sub‑domain of Remote Sensing applications; it is a subject of extensive and ongoing research. The procedure of automatically extracting roads from satellite imagery encounters significant challenges due to the multi‑scale and diverse structures of roads; improvement in this field is needed. The DeepLab series, known for its proficiency in semantic segmentation due to its efficiency in interpreting multi‑scale objects' features, addresses some of these challenges caused by the varying nature of roads. The present work proposes the utilization of DeepLabV3+, the latest version of the DeepLab series, by introducing an innovative Dense Depthwise Dilated Separable Spatial Pyramid Pooling (DenseDDSSPP) module and integrating it in place of the conventional Atrous Spatial Pyramid Pooling (ASPP) module. This modification enhances the extraction of complex road structures from satellite images. This study hypothesizes that the integration of DenseDDSSPP, combined with an appropriately selected backbone network and a Squeeze‑and‑Excitation block, will generate an efficient dense feature map by focusing on relevant features, leading to more precise and accurate road extraction from Remote Sensing images. The results section presents a comparison of our model's performance against state‑of‑the‑art models, demonstrating better results that highlight the effectiveness and success of the proposed approach.

Abstract:
Segmentation and classification of large numbers of instances, such as cell nuclei, are crucial tasks in digital pathology for accurate diagnosis. However, the availability of high‑quality datasets for deep learning methods is often limited due to the complexity of the annotation process. In this work, we investigate the impact of noisy annotations on the training and performance of a state‑of‑the‑art CNN model for the combined task of detecting, segmenting and classifying nuclei in histopathology images. In this context, we investigate the conditions for determining an appropriate number of training epochs to prevent overfitting to annotation noise during training. Our results indicate that the utilisation of a small, correctly annotated validation set is instrumental in avoiding overfitting and maintaining model performance to a large extent. Additionally, our findings underscore the beneficial role of pre‑training.

Abstract:
Neural network performance scales with both model size and data volume, as shown in both language and image processing. This requires scaling‑friendly architectures and large datasets. While transformers have been adapted for 3D vision, a `GPT‑moment' remains elusive due to limited training data. We introduce ARKit LabelMaker, a large‑scale real‑world 3D dataset with dense semantic annotation that is more than three times larger than prior largest dataset. Specifically, we extend ARKitScenes with automatically generated dense 3D labels using an extended LabelMaker pipeline, tailored for large‑scale pre‑training. Training on our dataset improves accuracy across architectures, achieving state‑of‑the‑art 3D semantic segmentation scores on ScanNet and ScanNet200, with notable gains on tail classes. Our code is available at https://labelmaker.org and our dataset at https://huggingface.co/datasets/labelmaker/arkit_labelmaker.

Abstract:
The diagnosis of diabetic retinopathy, which relies on fundus images, faces challenges in achieving transparency and interpretability when using a global classification approach. However, segmentation‑based databases are significantly more expensive to acquire and combining them is often problematic. This paper introduces a novel method, termed adversarial style conversion, to address the lack of standardization in annotation styles across diverse databases. By training a single architecture on combined databases, the model spontaneously modifies its segmentation style depending on the input, demonstrating the ability to convert among different labeling styles. The proposed methodology adds a linear probe to detect dataset origin based on encoder features and employs adversarial attacks to condition the model's segmentation style. Results indicate significant qualitative and quantitative through dataset combination, offering avenues for improved model generalization, uncertainty estimation and continuous interpolation between annotation styles. Our approach enables training a segmentation model with diverse databases while controlling and leveraging annotation styles for improved retinopathy diagnosis.

Abstract:
Data augmentation methods such as Copy‑Paste have been studied as effective ways to expand training datasets while incurring minimal costs. While such methods have been extensively implemented for image level tasks, we found no scalable implementation of Copy‑Paste built specifically for video tasks. In this paper, we leverage the recent growth in video fidelity of generative models to explore effective ways of incorporating synthetically generated objects into existing video datasets to artificially expand object instance pools. We first procure synthetic video sequences featuring objects that morph dynamically with time. Our carefully devised pipeline automatically segments then copy‑pastes these dynamic instances across the frames of any target background video sequence. We name our video data augmentation pipeline Synthetic Dynamic Instance Copy‑Paste, and test it on the complex task of Video Instance Segmentation which combines detection, segmentation and tracking of object instances across a video sequence. Extensive experiments on the popular Youtube‑VIS 2021 dataset using two separate popular networks as baselines achieve strong gains of +2.9 AP (6.5%) and +2.1 AP (4.9%). We make our code and models publicly available.

Abstract:
Semi‑supervised learning (SSL) for medical image segmentation is a challenging yet highly practical task, which reduces reliance on large‑scale labeled dataset by leveraging unlabeled samples. Among SSL techniques, the weak‑to‑strong consistency framework, popularized by FixMatch, has emerged as a state‑of‑the‑art method in classification tasks. Notably, such a simple pipeline has also shown competitive performance in medical image segmentation. However, two key limitations still persist, impeding its efficient adaptation: (1) the neglect of contextual dependencies results in inconsistent predictions for similar semantic features, leading to incomplete object segmentation; (2) the lack of exploitation of semantic similarity between labeled and unlabeled data induces considerable class‑distribution discrepancy. To address these limitations, we propose a novel semi‑supervised framework based on FixMatch, named SemSim, powered by two appealing designs from semantic similarity perspective: (1) rectifying pixel‑wise prediction by reasoning about the intra‑image pair‑wise affinity map, thus integrating contextual dependencies explicitly into the final prediction; (2) bridging labeled and unlabeled data via a feature querying mechanism for compact class representation learning, which fully considers cross‑image anatomical similarities. As the reliable semantic similarity extraction depends on robust features, we further introduce an effective spatial‑aware fusion module (SFM) to explore distinctive information from multiple scales. Extensive experiments show that SemSim yields consistent improvements over the state‑of‑the‑art methods across three public segmentation benchmarks.

Abstract:
Distribution shifts widely exist in medical images acquired from different medical centres, hindering the deployment of semantic segmentation models trained on one centre (source domain) to another (target domain). While unsupervised domain adaptation has shown significant promise in mitigating these shifts, it poses privacy risks due to sharing data between centres. To facilitate adaptation while preserving data privacy, source‑free domain adaptation (SFDA) and test‑time adaptation (TTA) have emerged as effective paradigms, relying solely on target domain data. However, SFDA requires a pre‑collected target domain dataset before deployment. TTA insufficiently exploit the potential value of test data, as it processes the test data only once. Considering that most medical centres operate during the day and remain inactive at night in clinical practice, we propose a novel adaptation framework called Day‑Night Adaptation (DyNA) with above insights, which performs adaptation through day‑night cycles without requiring access to source data. During the day, a low‑frequency prompt is trained to adapt the frozen model to each test sample. We construct a memory bank for prompt initialization and develop a warm‑up mechanism to enhance prompt training. During the night, we reuse test data collected from the day and introduce a global student model to bridge the knowledge between teacher and student models, facilitating model fine‑tuning while ensuring training stability. Extensive experiments demonstrate that our DyNA outperforms existing TTA and SFDA methods on two benchmark medical image segmentation tasks. Code will be available after the paper is published.

Abstract:
Referring multi‑object tracking (RMOT) is an emerging cross‑modal task that aims to locate an arbitrary number of target objects and maintain their identities referred by a language expression in a video. This intricate task involves the reasoning of linguistic and visual modalities, along with the temporal association of target objects. However, the seminal work employs only loose feature fusion and overlooks the utilization of long‑term information on tracked objects. In this study, we introduce a compact Transformer‑based method, termed TenRMOT. We conduct feature fusion at both encoding and decoding stages to fully exploit the advantages of Transformer architecture. Specifically, we incrementally perform cross‑modal fusion layer‑by‑layer during the encoding phase. In the decoding phase, we utilize language‑guided queries to probe memory features for accurate prediction of the desired objects. Moreover, we introduce a query update module that explicitly leverages temporal prior information of the tracked objects to enhance the consistency of their trajectories. In addition, we introduce a novel task called Referring Multi‑Object Tracking and Segmentation (RMOTS) and construct a new dataset named Ref‑KITTI Segmentation. Our dataset consists of 18 videos with 818 expressions, and each expression averages 10.7 masks, which poses a greater challenge compared to the typical single mask in most existing referring video segmentation datasets. TenRMOT demonstrates superior performance on both the referring multi‑object tracking and the segmentation tasks.

Abstract:
Automated vehicles rely on an accurate and robust perception of the environment. Similarly to automated cars, highly automated trains require an environmental perception. Although there is a lot of research based on either camera or LiDAR sensors in the automotive domain, very few contributions for this task exist yet for automated trains. Additionally, no public dataset or described approach for a 3D LiDAR semantic segmentation in the railway environment exists yet. Thus, we propose an approach for a point‑wise 3D semantic segmentation based on the 2DPass network architecture using scans and images jointly. In addition, we present a semi‑automated intelligent data annotation approach, which we use to efficiently and accurately label the required dataset recorded on a railway track in Germany. To improve performance despite a still small number of labeled scans, we apply an active learning approach to intelligently select scans for the training dataset. Our contributions are threefold: We annotate rail data including camera and LiDAR data from the railway environment, transfer label the raw LiDAR point clouds using an image segmentation network, and train a state‑of‑the‑art 3D LiDAR semantic segmentation network efficiently leveraging active learning. The trained network achieves good segmentation results with a mean IoU of 71.48% of 9 classes.

Abstract:
Recent advancements in artificial intelligence (AI) have precipitated a paradigm shift in medical imaging, particularly revolutionizing the domain of brain imaging. This paper systematically investigates the integration of deep learning ‑‑ a principal branch of AI ‑‑ into the semantic segmentation of brain images. Semantic segmentation serves as an indispensable technique for the delineation of discrete anatomical structures and the identification of pathological markers, essential for the diagnosis of complex neurological disorders. Historically, the reliance on manual interpretation by radiologists, while noteworthy for its accuracy, is plagued by inherent subjectivity and inter‑observer variability. This limitation becomes more pronounced with the exponential increase in imaging data, which traditional methods struggle to process efficiently and effectively. In response to these challenges, this study introduces the application of adversarial neural networks, a novel AI approach that not only automates but also refines the semantic segmentation process. By leveraging these advanced neural networks, our approach enhances the precision of diagnostic outputs, reducing human error and increasing the throughput of imaging data analysis. The paper provides a detailed discussion on how adversarial neural networks facilitate a more robust, objective, and scalable solution, thereby significantly improving diagnostic accuracies in neurological evaluations. This exploration highlights the transformative impact of AI on medical imaging, setting a new benchmark for future research and clinical practice in neurology.

Abstract:
Incremental Few‑Shot Semantic Segmentation (iFSS) tackles a task that requires a model to continually expand its segmentation capability on novel classes using only a few annotated examples. Typical incremental approaches encounter a challenge that the objective of the base training phase (fitting base classes with sufficient instances) does not align with the incremental learning phase (rapidly adapting to new classes with less forgetting). This disconnect can result in suboptimal performance in the incremental setting. This study introduces a meta‑learning‑based prototype approach that encourages the model to learn how to adapt quickly while preserving previous knowledge. Concretely, we mimic the incremental evaluation protocol during the base training session by sampling a sequence of pseudo‑incremental tasks. Each task in the simulated sequence is trained using a meta‑objective to enable rapid adaptation without forgetting. To enhance discrimination among class prototypes, we introduce prototype space redistribution learning, which dynamically updates class prototypes to establish optimal inter‑prototype boundaries within the prototype space. Extensive experiments on iFSS datasets built upon PASCAL and COCO benchmarks show the advanced performance of the proposed approach, offering valuable insights for addressing iFSS challenges.

Abstract:
This paper presents a method for generating large‑scale datasets to improve class‑agnostic video segmentation across robots with different form factors. Specifically, we consider the question of whether video segmentation models trained on generic segmentation data could be more effective for particular robot platforms if robot embodiment is factored into the data generation process. To answer this question, a pipeline is formulated for using 3D reconstructions (e.g. from HM3DSem) to generate segmented videos that are configurable based on a robot's embodiment (e.g. sensor type, sensor placement, and illumination source). A resulting massive RGB‑D video panoptic segmentation dataset (MVPd) is introduced for extensive benchmarking with foundation and video segmentation models, as well as to support embodiment‑focused research in video segmentation. Our experimental findings demonstrate that using MVPd for finetuning can lead to performance improvements when transferring foundation models to certain robot embodiments, such as specific camera placements. These experiments also show that using 3D modalities (depth images and camera pose) can lead to improvements in video segmentation accuracy and consistency. The project webpage is available at https://topipari.com/projects/MVPd

Abstract:
In this paper, we address the vision‑based autonomous landing problem in complex urban environments using deep neural networks for semantic segmentation and risk assessment. We propose employing the SegFormer, a state‑of‑the‑art visual transformer network, for the semantic segmentation of complex, unstructured urban environments. This approach yields valuable information that can be utilized in smart autonomous landing missions, particularly in emergency landing scenarios resulting from system failures or human errors. The assessment is done in real‑time flight, when images of an RGB camera at the Unmanned Aerial Vehicle (UAV) are segmented with the SegFormer into the most common classes found in urban environments. These classes are then mapped into a level of risk, considering in general, potential material damage, damaging the drone itself and endanger people. The proposed strategy is validated through several case studies, demonstrating the huge potential of semantic segmentation‑based strategies to determining the safest landing areas for autonomous emergency landing, which we believe will help unleash the full potential of UAVs on civil applications within urban areas.

Abstract:
Osteoarthritis is a degenerative condition affecting bones and cartilage, often leading to osteophyte formation, bone density loss, and joint space narrowing. Treatment options to restore normal joint function vary depending on the severity of the condition. This work introduces an innovative deep‑learning framework processing shoulder CT scans. It features the semantic segmentation of the proximal humerus and scapula, the 3D reconstruction of bone surfaces, the identification of the glenohumeral (GH) joint region, and the staging of three common osteoarthritic‑related pathologies: osteophyte formation (OS), GH space reduction (JS), and humeroscapular alignment (HSA). The pipeline comprises two cascaded CNN architectures: 3D CEL‑UNet for segmentation and 3D Arthro‑Net for threefold classification. A retrospective dataset of 571 CT scans featuring patients with various degrees of GH osteoarthritic‑related pathologies was used to train, validate, and test the pipeline. Root mean squared error and Hausdorff distance median values for 3D reconstruction were 0.22mm and 1.48mm for the humerus and 0.24mm and 1.48mm for the scapula, outperforming state‑of‑the‑art architectures and making it potentially suitable for a PSI‑based shoulder arthroplasty preoperative plan context. The classification accuracy for OS, JS, and HSA consistently reached around 90% across all three categories. The computational time for the inference pipeline was less than 15s, showcasing the framework's efficiency and compatibility with orthopedic radiology practice. The outcomes represent a promising advancement toward the medical translation of artificial intelligence tools. This progress aims to streamline the preoperative planning pipeline delivering high‑quality bone surfaces and supporting surgeons in selecting the most suitable surgical approach according to the unique patient joint conditions.

Abstract:
Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding through knowledge distillation, marking considerable advancements. Nonetheless, challenges such as the misalignment between 2D and 3D representations and the persistent long‑tail distribution in 3D datasets still restrict the effectiveness of knowledge distillation from 2D to 3D using foundation models. To tackle these issues, we introduce a novel SAM‑guided tokenization method that seamlessly aligns 3D transformer structures with region‑level knowledge distillation, replacing the traditional KNN‑based tokenization techniques. Additionally, we implement a group‑balanced re‑weighting strategy to effectively address the long‑tail problem in knowledge distillation. Furthermore, inspired by the recent success of masked feature prediction, our framework incorporates a two‑stage masked token prediction process in which the student model predicts both the global embeddings and the token‑wise local embeddings derived from the teacher models trained in the first stage. Our methodology has been validated across multiple datasets, including SUN RGB‑D, ScanNet, and S3DIS, for tasks like 3D object detection and semantic segmentation. The results demonstrate significant improvements over current State‑of‑the‑art self‑supervised methods, establishing new benchmarks in this field.

Abstract:
In this work, we propose a novel approach, namely WeatherDG, that can generate realistic, weather‑diverse, and driving‑screen images based on the cooperation of two foundation models, i.e, Stable Diffusion (SD) and Large Language Model (LLM). Specifically, we first fine‑tune the SD with source data, aligning the content and layout of generated samples with real‑world driving scenarios. Then, we propose a procedural prompt generation method based on LLM, which can enrich scenario descriptions and help SD automatically generate more diverse, detailed images. In addition, we introduce a balanced generation strategy, which encourages the SD to generate high‑quality objects of tailed classes under various weather conditions, such as riders and motorcycles. This segmentation‑model‑agnostic method can improve the generalization ability of existing models by additionally adapting them with the generated synthetic data. Experiments on three challenging datasets show that our method can significantly improve the segmentation performance of different state‑of‑the‑art models on target domains. Notably, in the setting of ''Cityscapes to ACDC'', our method improves the baseline HRDA by 13.9% in mIoU.

Abstract:
Attempting to apply deep learning methods to wood panels bark removal equipment to enhance the quality and efficiency of bark removal is a significant and challenging endeavor. This study develops and tests a deep learning‑based wood panels bark removal equipment. In accordance with the practical requirements of sawmills, a wood panels bark removal equipment equipped with a vision inspection system is designed. Based on a substantial collection of wood panel images obtained using the visual inspection system, the first general wood panels semantic segmentation dataset is constructed for training the BiSeNetV1 model employed in this study. Furthermore, the calculation methods and processes for the essential key data required in the bark removal process are presented in detail. Comparative experiments of the BiSeNetV1 model and tests of bark removal effectiveness are conducted in both laboratory and sawmill environments. The results of the comparative experiments indicate that the application of the BiSeNetV1 segmentation model is rational and feasible. The results of the bark removal effectiveness tests demonstrate a significant improvement in both the quality and efficiency of bark removal. The developed equipment fully meets the sawmill's requirements for precision and efficiency in bark removal processing.

Abstract:
The emergence of Segment Anything (SAM) sparked research interest in the field of interactive segmentation, especially in the context of image editing tasks and speeding up data annotation. Unlike common semantic segmentation, interactive segmentation methods allow users to directly influence their output through prompts (e.g. clicks). However, click patterns in real‑world interactive segmentation scenarios remain largely unexplored. Most methods rely on the assumption that users would click in the center of the largest erroneous area. Nevertheless, recent studies show that this is not always the case. Thus, methods may have poor performance in real‑world deployment despite high metrics in a baseline benchmark. To accurately simulate real‑user clicks, we conducted a large crowdsourcing study of click patterns in an interactive segmentation scenario and collected 475K real‑user clicks. Drawing on ideas from saliency tasks, we develop a clickability model that enables sampling clicks, which closely resemble actual user inputs. Using our model and dataset, we propose RClicks benchmark for a comprehensive comparison of existing interactive segmentation methods on realistic clicks. Specifically, we evaluate not only the average quality of methods, but also the robustness w.r.t. click patterns. According to our benchmark, in real‑world usage interactive segmentation models may perform worse than it has been reported in the baseline benchmark, and most of the methods are not robust. We believe that RClicks is a significant step towards creating interactive segmentation methods that provide the best user experience in real‑world cases.

Abstract:
Visual‑textual correlations in the attention maps derived from text‑to‑image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context‑rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual‑textual correlations. To solve this, we propose InvSeg, a test‑time prompt inversion method that tackles open‑vocabulary semantic segmentation by inverting image‑specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure‑consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner‑class pixels closer while separating inter‑class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample‑specific context, InvSeg learns context‑rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state‑of‑the‑art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.

Abstract:
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We hypothesize that this is due to VLMs adopting pre‑trained vision backbones, specifically vision transformers (ViTs) trained with image‑level supervision and minimal inductive biases. Such models may fail to encode the class contents at each position in the image, and our goal is to resolve this with a vision backbone that effectively captures both local and global image semantics. Our main insight is that we do not require new supervision to learn this capability ‑ pre‑trained models contain significant knowledge of local semantics that we can extract and use for scalable self‑supervision. We propose a new efficient post‑training stage for ViTs called locality alignment and a novel fine‑tuning procedure called MaskEmbed that uses a masked reconstruction loss to learn semantic contributions for each image patch. We first evaluate locality alignment with a vision‑only benchmark, finding that it improves a model's performance at patch‑level semantic segmentation, especially for strong backbones trained with image‑caption pairs (e.g., CLIP and SigLIP). We then train a series of VLMs with and without locality alignment, and show that locality‑aligned backbones improve performance across a range of benchmarks, particularly ones that involve spatial understanding (e.g., RefCOCO, OCID‑Ref, TallyQA, VSR, AI2D). Overall, we demonstrate that we can efficiently learn local semantic extraction via a locality alignment stage, and that this procedure benefits VLM training recipes that use off‑the‑shelf vision backbones.

Abstract:
In Autonomous Driving (AD) Perception, cyclists are considered safety‑critical scene objects. Commonly used publicly‑available AD datasets typically contain large amounts of car and vehicle object instances but a low number of cyclist instances, usually with limited appearance and pose diversity. This cyclist training data scarcity problem not only limits the generalization of deep‑learning perception models for cyclist semantic segmentation, pose estimation, and cyclist crossing intention prediction, but also limits research on new cyclist‑related tasks such as fine‑grained cyclist pose estimation and spatio‑temporal analysis under complex interactions between humans and articulated objects. To address this data scarcity problem, in this paper we propose a framework to generate synthetic dynamic 3D cyclist data assets that can be used to generate training data for different tasks. In our framework, we designed a methodology for creating a new part‑based multi‑view articulated synthetic 3D bicycle dataset that we call 3DArticBikes that we use to train a 3D Gaussian Splatting (3DGS)‑based reconstruction and image rendering method. We then propose a parametric bicycle 3DGS composition model to assemble 8‑DoF pose‑controllable 3D bicycles. Finally, using dynamic information from cyclist videos, we build a complete synthetic dynamic 3D cyclist (rider pedaling a bicycle) by re‑posing a selectable synthetic 3D person, while automatically placing the rider onto one of our new articulated 3D bicycles using a proposed 3D Keypoint optimization‑based Inverse Kinematics pose refinement. We present both, qualitative and quantitative results where we compare our generated cyclists against those from a recent stable diffusion‑based method.

Abstract:
Semantic segmentation of remote sensing images is a fundamental task in geospatial research. However, widely used Convolutional Neural Networks (CNNs) and Transformers have notable drawbacks: CNNs may be limited by insufficient remote sensing modeling capability, while Transformers face challenges due to computational complexity. In this paper, we propose a remote‑sensing image semantic segmentation network named LKASeg, which combines Large Kernel Attention(LSKA) and Full‑Scale Skip Connections(FSC). Specifically, we propose a decoder based on Large Kernel Attention (LKA), which extract global features while avoiding the computational overhead of self‑attention and providing channel adaptability. To achieve full‑scale feature learning and fusion, we apply Full‑Scale Skip Connections (FSC) between the encoder and decoder. We conducted experiments by combining the LKA‑based decoder with FSC. On the ISPRS Vaihingen dataset, the mF1 and mIoU scores achieved 90.33% and 82.77%.

Abstract:
Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. To compensate for the 2D structure information loss (e.g., local similarity) of the original image, most existing methods focus on designing different orders to sequentially process the tokens, which could only alleviate this issue to some extent. In this paper, we propose a Visual 2‑Dimensional Mamba (V2M) model as a complete solution, which directly processes image tokens in the 2D space. We first generalize SSM to the 2‑dimensional space which generates the next state considering two adjacent states on both dimensions (e.g., columns and rows). We then construct our V2M based on the 2‑dimensional SSM formulation and incorporate Mamba to achieve hardware‑efficient parallel processing. The proposed V2M effectively incorporates the 2D locality prior yet inherits the efficiency and input‑dependent scalability of Mamba. Extensive experimental results on ImageNet classification and downstream visual tasks including object detection and instance segmentation on COCO and semantic segmentation on ADE20K demonstrate the effectiveness of our V2M compared with other visual backbones.

Abstract:
Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch‑based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet‑1K, object detection on COCO, and semantic segmentation on ADE20K.

Abstract:
Kolmogorov‑Arnold Networks (KANs) have recently gained attention as an alternative to traditional Multilayer Perceptrons (MLPs) in deep learning frameworks. KANs have been integrated into various deep learning architectures such as convolutional neural networks, graph neural networks, and transformers, with their performance evaluated. However, their effectiveness within point‑cloud‑based neural networks remains unexplored. To address this gap, we incorporate KANs into PointNet for the first time to evaluate their performance on 3D point cloud classification and segmentation tasks. Specifically, we introduce PointNet‑KAN, built upon two key components. First, it employs KANs instead of traditional MLPs. Second, it retains the core principle of PointNet by using shared KAN layers and applying symmetric functions for global feature extraction, ensuring permutation invariance with respect to the input features. In traditional MLPs, the goal is to train the weights and biases with fixed activation functions; however, in KANs, the goal is to train the activation functions themselves. We use Jacobi polynomials to construct the KAN layers. We extensively and systematically evaluate PointNet‑KAN across various polynomial degrees and special types such as the Lagrange, Chebyshev, and Gegenbauer polynomials. Our results show that PointNet‑KAN achieves competitive performance compared to PointNet with MLPs on benchmark datasets for 3D object classification and part and semantic segmentation, despite employing a shallower and simpler network architecture. We also study a hybrid PointNet model incorporating both KAN and MLP layers. We hope this work serves as a foundation and provides guidance for integrating KANs, as an alternative to MLPs, into more advanced point cloud processing architectures.

Abstract:
Segment Anything Model (SAM) has gained significant recognition in the field of semantic segmentation due to its versatile capabilities and impressive performance. Despite its success, SAM faces two primary limitations: (1) it relies heavily on meticulous human‑provided prompts like key points, bounding boxes or text messages, which is labor‑intensive; (2) the mask decoder's feature representation is sometimes inaccurate, as it solely employs dot product operations at the end of mask decoder, which inadequately captures the necessary correlations for precise segmentation. Current solutions to these problems such as fine‑tuning SAM often require retraining a large number of parameters, which needs huge amount of time and computing resources. To address these limitations, we propose an automated prompting and mask calibration method called AM‑SAM based on a bi‑level optimization framework. Our approach automatically generates prompts for an input image, eliminating the need for human involvement with a good performance in early training epochs, achieving faster convergence. Additionally, we freeze the main part of SAM, and modify the mask decoder with Low‑Rank Adaptation (LoRA), enhancing the mask decoder's feature representation by incorporating advanced techniques that go beyond simple dot product operations to more accurately capture and utilize feature correlations. Our experimental results demonstrate that AM‑SAM achieves significantly accurate segmentation, matching or exceeding the effectiveness of human‑generated and default prompts. Notably, on the body segmentation dataset, our method yields a 5% higher dice score with a 4‑example few‑shot training set compared to the SOTA method, underscoring its superiority in semantic segmentation tasks.

Abstract:
Ensuring thermal comfort is essential for the well‑being and productivity of individuals in built environments. Of the various thermal comfort indicators, the mean radiant temperature (MRT) is very challenging to measure. Most common measurement methodologies are time‑consuming and not user‑friendly. To address this issue, this paper proposes a novel MRT measurement framework that uses visual simultaneous localization and mapping (SLAM) and semantic segmentation techniques. The proposed approach follows the rule of thumb of the traditional MRT calculation method using surface temperature and view factors. However, it employs visual SLAM and creates a 3D thermal point cloud with enriched surface temperature information. The framework then implements Grounded SAM, a new object detection and segmentation tool to extract features with distinct temperature profiles on building surfaces. The detailed segmentation of thermal features not only reduces potential errors in the calculation of the MRT but also provides an efficient reconstruction of the spatial MRT distribution in the indoor environment. We also validate the calculation results with the reference measurement methodology. This data‑driven framework offers faster and more efficient MRT measurements and spatial mapping than conventional methods. It can enable the direct engagement of researchers and practitioners in MRT measurements and contribute to research on thermal comfort and radiant cooling and heating systems.

Abstract:
Many image processing applications rely on partitioning an image into disjoint regions whose pixels are 'similar.' The watershed and waterfall transforms are established mathematical morphology pixel clustering techniques. They are both relevant to modern applications where groups of pixels are to be decided upon in one go, or where adjacency information is relevant. We introduce three new parallel partitioning algorithms for GPUs. By repeatedly applying watershed algorithms, we produce waterfall results which form a hierarchy of partition regions over an input image. Our watershed algorithms attain competitive execution times in both 2D and 3D, processing an 800 megavoxel image in less than 1.4 sec. We also show how to use this fully deterministic image partitioning as a pre‑processing step to machine learning based semantic segmentation. This replaces the role of superpixel algorithms, and results in comparable accuracy and faster training times.

Abstract:
Video segmentation is essential for advancing robotics and autonomous driving, particularly in open‑world settings where continuous perception and object association across video frames are critical. While the Segment Anything Model (SAM) has excelled in static image segmentation, extending its capabilities to video segmentation poses significant challenges. We tackle two major hurdles: a) SAM's embedding limitations in associating objects across frames, and b) granularity inconsistencies in object segmentation. To this end, we introduce VideoSAM, an end‑to‑end framework designed to address these challenges by improving object tracking and segmentation consistency in dynamic environments. VideoSAM integrates an agglomerated backbone, RADIO, enabling object association through similarity metrics and introduces Cycle‑ack‑Pairs Propagation with a memory mechanism for stable object tracking. Additionally, we incorporate an autoregressive object‑token mechanism within the SAM decoder to maintain consistent granularity across frames. Our method is extensively evaluated on the UVO and BURST benchmarks, and robotic videos from RoboTAP, demonstrating its effectiveness and robustness in real‑world scenarios. All codes will be available.

Abstract:
Safe navigation in new environments requires autonomous vehicles and robots to accurately interpret their surroundings, relying on LiDAR scene segmentation, out‑of‑distribution (OOD) obstacle detection, and uncertainty computation. We propose a method to distinguish in‑distribution (ID) from OOD samples and quantify both epistemic and aleatoric uncertainties using the feature space of a single deterministic model. After training a semantic segmentation network, a Gaussian Mixture Model (GMM) is fitted to its feature space. OOD samples are detected by checking if their squared Mahalanobis distances to each Gaussian component conform to a chi‑squared distribution, eliminating the need for an additional OOD training set. Given that the estimated mean and covariance matrix of a multivariate Gaussian distribution follow Gaussian and Inverse‑Wishart distributions, multiple GMMs are generated by sampling from these distributions to assess epistemic uncertainty through classification variability. Aleatoric uncertainty is derived from the entropy of responsibility values within Gaussian components. Comparing our method with deep ensembles and logit‑sampling for uncertainty computation demonstrates its superior performance in real‑world applications for quantifying epistemic and aleatoric uncertainty, as well as detecting OOD samples. While deep ensembles miss some highly uncertain samples, our method successfully detects them and assigns high epistemic uncertainty.

Abstract:
Within a perception framework for autonomous mobile and robotic systems, semantic analysis of 3D point clouds typically generated by LiDARs is key to numerous applications, such as object detection and recognition, and scene reconstruction. Scene semantic segmentation can be achieved by directly integrating 3D spatial data with specialized deep neural networks. Although this type of data provides rich geometric information regarding the surrounding environment, it also presents numerous challenges: its unstructured and sparse nature, its unpredictable size, and its demanding computational requirements. These characteristics hinder the real‑time semantic analysis, particularly on resource‑constrained hardware architectures that constitute the main computational components of numerous robotic applications. Therefore, in this paper, we investigate various 3D semantic segmentation methodologies and analyze their performance and capabilities for resource‑constrained inference on embedded NVIDIA Jetson platforms. We evaluate them for a fair comparison through a standardized training protocol and data augmentations, providing benchmark results on the Jetson AGX Orin and AGX Xavier series for two large‑scale outdoor datasets: SemanticKITTI and nuScenes.

Abstract:
Despite alleviating the dependence on dense annotations inherent to fully supervised methods, weakly supervised point cloud semantic segmentation suffers from inadequate supervision signals. In response to this challenge, we introduce a novel perspective that imparts auxiliary constraints by regulating the feature space under weak supervision. Our initial investigation identifies which distributions accurately characterize the feature space, subsequently leveraging this priori to guide the alignment of the weakly supervised embeddings. Specifically, we analyze the superiority of the mixture of von Mises‑Fisher distributions (moVMF) among several common distribution candidates. Accordingly, we develop a Distribution Guidance Network (DGNet), which comprises a weakly supervised learning branch and a distribution alignment branch. Leveraging reliable clustering initialization derived from the weakly supervised learning branch, the distribution alignment branch alternately updates the parameters of the moVMF and the network, ensuring alignment with the moVMF‑defined latent space. Extensive experiments validate the rationality and effectiveness of our distribution choice and network design. Consequently, DGNet achieves state‑of‑the‑art performance under multiple datasets and various weakly supervised settings.

Abstract:
Video segmentation is a popular task, but applying image segmentation models frame‑by‑frame to videos does not preserve temporal consistency. In this paper, we propose a method to extend a query‑based image segmentation model to video using feature shift and query matching. The method uses a query‑based architecture, where decoded queries represent segmentation masks. These queries should be matched before performing the feature shift to ensure that the shifted queries represent the same mask across different frames. Experimental results on CityScapes‑VPS and VSPW show significant improvements from the baselines, highlighting the method's effectiveness in enhancing segmentation quality while efficiently reusing pre‑trained weights.

Abstract:
Recent advancements in 3D reconstruction methods and vision‑language models have propelled the development of multi‑modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi‑modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a balance between visual and language modalities, which leads to unsatisfying semantic rasterization of translucent or reflective objects, as well as over‑fitting on color modality. To alleviate these limitations, we propose a solution that adequately handles the distinct visual and semantic modalities, i.e., a 3D vision‑language Gaussian splatting model for scene understanding, to put emphasis on the representation learning of language modality. We propose a novel cross‑modal rasterizer, using modality fusion along with a smoothed semantic indicator for enhancing semantic rasterization. We also employ a camera‑view blending technique to improve semantic consistency between existing and synthesized views, thereby effectively mitigating over‑fitting. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance in open‑vocabulary semantic segmentation, surpassing existing methods by a significant margin.

Abstract:
Instance segmentation is a core computer vision task with great practical significance. Recent advances, driven by large‑scale benchmark datasets, have yielded good general‑purpose Convolutional Neural Network (CNN)‑based methods. Natural Resource Monitoring (NRM) utilizes remote sensing imagery with generally known scale and containing multiple overlapping instances of the same class, wherein the object contours are jagged and highly irregular. This is in stark contrast with the regular man‑made objects found in classic benchmark datasets. We address this problem and propose a novel instance segmentation method geared towards NRM imagery. We formulate the problem as Bayesian maximum a posteriori inference which, in learning the individual object contours, incorporates shape, location, and position priors from state‑of‑the‑art CNN architectures, driving a simultaneous level‑set evolution of multiple object contours. We employ loose coupling between the CNNs that supply the priors and the active contour process, allowing a drop‑in replacement of new network architectures. Moreover, we introduce a novel prior for contour shape, namely, a class of Deep Shape Models based on architectures from Generative Adversarial Networks (GANs). These Deep Shape Models are in essence a non‑linear generalization of the classic Eigenshape formulation. In experiments, we tackle the challenging, real‑world problem of segmenting individual dead tree crowns and delineating precise contours. We compare our method to two leading general‑purpose instance segmentation methods ‑ Mask R‑CNN and K‑net ‑ on color infrared aerial imagery. Results show our approach to significantly outperform both methods in terms of reconstruction quality of tree crown contours. Furthermore, use of the GAN‑based deep shape model prior yields significant improvement of all results over the vanilla Eigenshape prior.

Abstract:
Through automation, deep learning (DL) can enhance the analysis of transesophageal echocardiography (TEE) images. However, DL methods require large amounts of high‑quality data to produce accurate results, which is difficult to satisfy. Data augmentation is commonly used to tackle this issue. In this work, we develop a pipeline to generate synthetic TEE images and corresponding semantic labels. The proposed data generation pipeline expands on an existing pipeline that generates synthetic transthoracic echocardiography images by transforming slices from anatomical models into synthetic images. We also demonstrate that such images can improve DL network performance through a left‑ventricle semantic segmentation task. For the pipeline's unpaired image‑to‑image (I2I) translation section, we explore two generative methods: CycleGAN and contrastive unpaired translation. Next, we evaluate the synthetic images quantitatively using the Fréchet Inception Distance (FID) Score and qualitatively through a human perception quiz involving expert cardiologists and the average researcher. In this study, we achieve a dice score improvement of up to 10% when we augment datasets with our synthetic images. Furthermore, we compare established methods of assessing unpaired I2I translation and observe a disagreement when evaluating the synthetic images. Finally, we see which metric better predicts the generated data's efficacy when used for data augmentation.

Abstract:
Point cloud semantic segmentation, the process of classifying each point into predefined categories, is essential for 3D scene understanding. While image‑based segmentation is widely adopted due to its maturity, methods relying solely on RGB information often suffer from degraded performance due to color inaccuracies. Recent advancements have incorporated additional features such as intensity and geometric information, yet RGB channels continue to negatively impact segmentation accuracy when errors in colorization occur. Despite this, previous studies have not rigorously quantified the effects of erroneous colorization on segmentation performance. In this paper, we propose a novel statistical approach to evaluate the impact of inaccurate RGB information on image‑based point cloud segmentation. We categorize RGB inaccuracies into two types: incorrect color information and similar color information. Our results demonstrate that both types of color inaccuracies significantly degrade segmentation accuracy, with similar color errors particularly affecting the extraction of geometric features. These findings highlight the critical need to reassess the role of RGB information in point cloud segmentation and its implications for future algorithm design.

Abstract:
Semantic segmentation is a critical technique for effective scene understanding. Traditional RGB‑T semantic segmentation models often struggle to generalize across diverse scenarios due to their reliance on pretrained models and predefined categories. Recent advancements in Visual Language Models (VLMs) have facilitated a shift from closed‑set to open‑vocabulary semantic segmentation methods. However, these models face challenges in dealing with intricate scenes, primarily due to the heterogeneity between RGB and thermal modalities. To address this gap, we present Open‑RGBT, a novel open‑vocabulary RGB‑T semantic segmentation model. Specifically, we obtain instance‑level detection proposals by incorporating visual prompts to enhance category understanding. Additionally, we employ the CLIP model to assess image‑text similarity, which helps correct semantic consistency and mitigates ambiguities in category identification. Empirical evaluations demonstrate that Open‑RGBT achieves superior performance in diverse and challenging real‑world scenarios, even in the wild, significantly advancing the field of RGB‑T semantic segmentation.

Abstract:
Adverse weather conditions pose a significant challenge to the widespread adoption of Autonomous Vehicles (AVs) by impacting sensors like LiDARs and cameras. Even though Collaborative Perception (CP) improves AV perception in difficult conditions, existing CP datasets lack adverse weather conditions. To address this, we introduce Adver‑City, the first open‑source synthetic CP dataset focused on adverse weather conditions. Simulated in CARLA with OpenCDA, it contains over 24 thousand frames, over 890 thousand annotations, and 110 unique scenarios across six different weather conditions: clear weather, soft rain, heavy rain, fog, foggy heavy rain and, for the first time in a synthetic CP dataset, glare. It has six object categories including pedestrians and cyclists, and uses data from vehicles and roadside units featuring LiDARs, RGB and semantic segmentation cameras, GNSS, and IMUs. Its scenarios, based on real crash reports, depict the most relevant road configurations for adverse weather and poor visibility conditions, varying in object density, with both dense and sparse scenes, allowing for novel testing conditions of CP models. Benchmarks run on the dataset show that weather conditions created challenging conditions for perception models, with CoBEVT scoring 58.30/52.44/38.90 (AP@30/50/70). The dataset, code and documentation are available at https://labs.cs.queensu.ca/quarrg/datasets/adver‑city/.

Abstract:
This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information ‑‑ speech, gestures, and scene context ‑‑ to extract structured, actionable instructions for robots. Addressing the need for general‑purpose human‑robot interaction in domestic environments, UNCOM is designed for zero‑shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task‑specific deep learning models, it allows out‑of‑the‑box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object‑action‑target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a TIAGo++ robot and provide an evaluation on a real‑world data set of human‑robot interaction scenarios; achieving an 82.39% success rate over our benchmark data set, highlighting the robustness of the system to diversity, noise, and communication ambiguity. The data set, evaluation scenarios, and the code are publicly available to support future research.

Abstract:
Existing perception models achieve great success by learning from large amounts of labeled data, but they still struggle with open‑world scenarios. To alleviate this issue, researchers introduce open‑set perception tasks to detect or segment unseen objects in the training set. However, these models require predefined object categories as inputs during inference, which are not available in real‑world scenarios. Recently, researchers pose a new and more practical problem, i.e., open‑ended object detection, which discovers unseen objects without any object categories as inputs. In this paper, we present VL‑SAM, a training‑free framework that combines the generalized object recognition model (i.e., Vision‑Language Model) with the generalized object localization model (i.e., Segment‑Anything Model), to address the open‑ended object detection and segmentation task. Without additional training, we connect these two generalized models with attention maps as the prompts. Specifically, we design an attention map generation module by employing head aggregation and a regularized attention flow to aggregate and propagate attention maps across all heads and layers in VLM, yielding high‑quality attention maps. Then, we iteratively sample positive and negative points from the attention maps with a prompt generation module and send the sampled points to SAM to segment corresponding objects. Experimental results on the long‑tail instance segmentation dataset (LVIS) show that our method surpasses the previous open‑ended method on the object detection task and can provide additional instance segmentation masks. Besides, VL‑SAM achieves favorable performance on the corner case object detection dataset (CODA), demonstrating the effectiveness of VL‑SAM in real‑world applications. Moreover, VL‑SAM exhibits good model generalization that can incorporate various VLMs and SAMs.

Abstract:
With the development of steel materials, metallographic analysis has become increasingly important. Unfortunately, grain size analysis is a manual process that requires experts to evaluate metallographic photographs, which is unreliable and time‑consuming. To resolve this problem, we propose a novel classifi‑cation method based on deep learning, namely GSNets, a family of hybrid models which can effectively introduce guided self‑attention for classifying grain size. Concretely, we build our models from three insights:(1) Introducing our novel guided self‑attention module can assist the model in finding the generalized necessarily distinct vectors capable of retaining intricate rela‑tional connections and rich local feature information; (2) By improving the pixel‑wise linear independence of the feature map, the highly condensed semantic representation will be captured by the model; (3) Our novel triple‑stream merging module can significantly improve the generalization capability and efficiency of the model. Experiments show that our GSNet yields a classifi‑cation accuracy of 90.1%, surpassing the state‑of‑the‑art Swin Transformer V2 by 1.9% on the steel grain size dataset, which comprises 3,599 images with 14 grain size levels. Furthermore, we intuitively believe our approach is applicable to broader ap‑plications like object detection and semantic segmentation.

Abstract:
This research addresses the need for high‑definition (HD) maps for autonomous vehicles (AVs), focusing on road lane information derived from aerial imagery. While Earth observation data offers valuable resources for map creation, specialized models for road lane extraction are still underdeveloped in remote sensing. In this study, we perform an extensive comparison of twelve foundational deep learning‑based semantic segmentation models for road lane marking extraction from high‑definition remote sensing images, assessing their performance under transfer learning with partially labeled datasets. These models were fine‑tuned on the partially labeled Waterloo Urban Scene dataset, and pre‑trained on the SkyScapes dataset, simulating a likely scenario of real‑life model deployment under partial labeling. We observed and assessed the fine‑tuning performance and overall performance. Models showed significant performance improvements after fine‑tuning, with mean IoU scores ranging from 33.56% to 76.11%, and recall ranging from 66.0% to 98.96%. Transformer‑based models outperformed convolutional neural networks, emphasizing the importance of model pre‑training and fine‑tuning in enhancing HD map development for AV navigation.

Abstract:
As remote sensing imaging technology continues to advance and evolve, processing high‑resolution and diversified satellite imagery to improve segmentation accuracy and enhance interpretation efficiency emerg as a pivotal area of investigation within the realm of remote sensing. Although segmentation algorithms based on CNNs and Transformers achieve significant progress in performance, balancing segmentation accuracy and computational complexity remains challenging, limiting their wide application in practical tasks. To address this, this paper introduces state space model (SSM) and proposes a novel hybrid semantic segmentation network based on vision Mamba (CVMH‑UNet). This method designs a cross‑scanning visual state space block (CVSSBlock) that uses cross 2D scanning (CS2D) to fully capture global information from multiple directions, while by incorporating convolutional neural network branches to overcome the constraints of Vision Mamba (VMamba) in acquiring local information, this approach facilitates a comprehensive analysis of both global and local features. Furthermore, to address the issue of limited discriminative power and the difficulty in achieving detailed fusion with direct skip connections, a multi‑frequency multi‑scale feature fusion block (MFMSBlock) is designed. This module introduces multi‑frequency information through 2D discrete cosine transform (2D DCT) to enhance information utilization and provides additional scale local detail information through point‑wise convolution branches. Finally, it aggregates multi‑scale information along the channel dimension, achieving refined feature fusion. Findings from experiments conducted on renowned datasets of remote sensing imagery demonstrate that proposed CVMH‑UNet achieves superior segmentation performance while maintaining low computational complexity, outperforming surpassing current leading‑edge segmentation algorithms.

Abstract:
Deep segmentation networks achieve high performance when trained on specific datasets. However, in clinical practice, it is often desirable that pretrained segmentation models can be dynamically extended to enable segmenting new organs without access to previous training datasets or without training from scratch. This would ensure a much more efficient model development and deployment paradigm accounting for the patient privacy and data storage issues. This clinically preferred process can be viewed as a continual semantic segmentation (CSS) problem. Previous CSS works would either experience catastrophic forgetting or lead to unaffordable memory costs as models expand. In this work, we propose a new continual whole‑body organ segmentation model with light‑weighted low‑rank adaptation (LoRA). We first train and freeze a pyramid vision transformer (PVT) base segmentation model on the initial task, then continually add light‑weighted trainable LoRA parameters to the frozen model for each new learning task. Through a holistically exploration of the architecture modification, we identify three most important layers (i.e., patch‑embedding, multi‑head attention and feed forward layers) that are critical in adapting to the new segmentation tasks, while retaining the majority of the pretrained parameters fixed. Our proposed model continually segments new organs without catastrophic forgetting and meanwhile maintaining a low parameter increasing rate. Continually trained and tested on four datasets covering different body parts of a total of 121 organs, results show that our model achieves high segmentation accuracy, closely reaching the PVT and nnUNet upper bounds, and significantly outperforms other regularization‑based CSS methods. When comparing to the leading architecture‑based CSS method, our model has a substantial lower parameter increasing rate while achieving comparable performance.

Abstract:
Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)‑based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high‑frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High‑frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high‑frequency information processing through a global‑local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global‑local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine‑grained features. SAM selectively integrates boundary details from low‑level features with semantic information from high‑level features, significantly improving the model's ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC‑ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.

Abstract:
3D instance segmentation is crucial for obtaining an understanding of a point cloud scene. This paper presents a novel neural network architecture for performing instance segmentation on 3D point clouds. We propose to jointly learn coefficients and prototypes in parallel which can be combined to obtain the instance predictions. The coefficients are computed using an overcomplete set of sampled points with a novel multi‑scale module, dubbed dilated point inception. As the set of obtained instance mask predictions is overcomplete, we employ a non‑maximum suppression algorithm to retrieve the final predictions. This approach allows to omit the time‑expensive clustering step and leads to a more stable inference time. The proposed method is not only 28% faster than the state‑of‑the‑art, it also exhibits the lowest standard deviation. Our experiments have shown that the standard deviation of the inference time is only 1.0% of the total time while it ranges between 10.8 and 53.1% for the state‑of‑the‑art methods. Lastly, our method outperforms the state‑of‑the‑art both on S3DIS‑blocks (4.9% in mRec on Fold‑5) and PartNet (2.0% on average in mAP).

Abstract:
Recently, integrating the local modeling capabilities of Convolutional Neural Networks (CNNs) with the global dependency strengths of Transformers has created a sensation in the semantic segmentation community. However, substantial computational workloads and high hardware memory demands remain major obstacles to their further application in real‑time scenarios. In this work, we propose a Lightweight Multiple‑Information Interaction Network (LMIINet) for real‑time semantic segmentation, which effectively combines CNNs and Transformers while reducing redundant computations and memory footprints. It features Lightweight Feature Interaction Bottleneck (LFIB) modules comprising efficient convolutions that enhance context integration. Additionally, improvements are made to the Flatten Transformer by enhancing local and global feature interaction to capture detailed semantic information. Incorporating a combination coefficient learning scheme in both LFIB and Transformer blocks facilitates improved feature interaction. Extensive experiments demonstrate that LMIINet excels in balancing accuracy and efficiency. With only 0.72M parameters and 11.74G FLOPs (Floating Point Operations Per Second), LMIINet achieves 72.0% mIoU at 100 FPS (Frames Per Second) on the Cityscapes test set and 69.94% mIoU (mean Intersection over Union) at 160 FPS on the CamVid test dataset using a single RTX2080Ti GPU.

Abstract:
Melanoma segmentation in Whole Slide Images (WSIs) is useful for prognosis and the measurement of crucial prognostic factors such as Breslow depth and primary invasive tumor size. In this paper, we present a novel approach that uses the Segment Anything Model (SAM) for automatic melanoma segmentation in microscopy slide images. Our method employs an initial semantic segmentation model to generate preliminary segmentation masks that are then used to prompt SAM. We design a dynamic prompting strategy that uses a combination of centroid and grid prompts to achieve optimal coverage of the super high‑resolution slide images while maintaining the quality of generated prompts. To optimize for invasive melanoma segmentation, we further refine the prompt generation process by implementing in‑situ melanoma detection and low‑confidence region filtering. We select Segformer as the initial segmentation model and EfficientSAM as the segment anything model for parameter‑efficient fine‑tuning. Our experimental results demonstrate that this approach not only surpasses other state‑of‑the‑art melanoma segmentation methods but also significantly outperforms the baseline Segformer by 9.1% in terms of IoU.

Abstract:
This paper introduces a new approach to extract and analyze vector data from technical drawings in PDF format. Our method involves converting PDF files into SVG format and creating a feature‑rich graph representation, which captures the relationships between vector entities using geometrical information. We then apply a graph attention transformer with hierarchical label definition to achieve accurate line‑level segmentation. Our approach is evaluated on two datasets, including the public FloorplanCAD dataset, which achieves state‑of‑the‑art results on weighted F1 score, surpassing existing methods. The proposed vector‑based method offers a more scalable solution for large‑scale technical drawing analysis compared to vision‑based approaches, while also requiring significantly less GPU power than current state‑of‑the‑art vector‑based techniques. Moreover, it demonstrates improved performance in terms of the weighted F1 (wF1) score on the semantic segmentation task. Our results demonstrate the effectiveness of our approach in extracting meaningful information from technical drawings, enabling new applications, and improving existing workflows in the AEC industry. Potential applications of our approach include automated building information modeling (BIM) and construction planning, which could significantly impact the efficiency and productivity of the industry.

Abstract:
Few‑shot medical image segmentation (FSMIS) aims to perform the limited annotated data learning in the medical image analysis scope. Despite the progress has been achieved, current FSMIS models are all trained and deployed on the same data domain, as is not consistent with the clinical reality that medical imaging data is always across different data domains (e.g. imaging modalities, institutions and equipment sequences). How to enhance the FSMIS models to generalize well across the different specific medical imaging domains? In this paper, we focus on the matching mechanism of the few‑shot semantic segmentation models and introduce an Earth Mover's Distance (EMD) calculation based domain robust matching mechanism for the cross‑domain scenario. Specifically, we formulate the EMD transportation process between the foreground support‑query features, the texture structure aware weights generation method, which proposes to perform the sobel based image gradient calculation over the nodes, is introduced in the EMD matching flow to restrain the domain relevant nodes. Besides, the point set level distance measurement metric is introduced to calculated the cost for the transportation from support set nodes to query set nodes. To evaluate the performance of our model, we conduct experiments on three scenarios (i.e., cross‑modal, cross‑sequence and cross‑institution), which includes eight medical datasets and involves three body regions, and the results demonstrate that our model achieves the SoTA performance against the compared models.

Abstract:
The escalating use of Unmanned Aerial Vehicles (UAVs) as remote sensing platforms has garnered considerable attention, proving invaluable for ground object recognition. While satellite remote sensing images face limitations in resolution and weather susceptibility, UAV remote sensing, employing low‑speed unmanned aircraft, offers enhanced object resolution and agility. The advent of advanced machine learning techniques has propelled significant strides in image analysis, particularly in semantic segmentation for UAV remote sensing images. This paper evaluates the effectiveness and efficiency of SegFormer, a semantic segmentation framework, for the semantic segmentation of UAV images. SegFormer variants, ranging from real‑time (B0) to high‑performance (B5) models, are assessed using the UAVid dataset tailored for semantic segmentation tasks. The research details the architecture and training procedures specific to SegFormer in the context of UAV semantic segmentation. Experimental results showcase the model's performance on benchmark dataset, highlighting its ability to accurately delineate objects and land cover features in diverse UAV scenarios, leading to both high efficiency and performance.

Abstract:
Robots‑based smart pharmacies are essential for modern healthcare systems, enabling efficient drug delivery. However, a critical challenge exists in the robotic handling of drugs with varying shapes and overlapping positions, which previous studies have not adequately addressed. To enhance the robotic arm's ability to grasp chaotic, overlapping, and variously shaped drugs, this paper proposed a novel framework combining a multi‑stage grasping network with an adaptive robotics mechanism. The framework first preprocessed images using an improved Super‑Resolution Convolutional Neural Network (SRCNN) algorithm, and then employed the proposed YOLOv5+E‑A‑SPPFCSPC+BIFPNC (YOLO‑EASB) instance segmentation algorithm for precise drug segmentation. The most suitable drugs for grasping can be determined by assessing the completeness of the segmentation masks. Then, these segmented drugs were processed by our improved Adaptive Feature Fusion and Grasp‑Aware Network (IAFFGA‑Net) with the optimized loss function, which ensures accurate picking actions even in complex environments. To control the robot grasping, a time‑optimal robotic arm trajectory planning algorithm that combines an improved ant colony algorithm with 3‑5‑3 interpolation was developed, further improving efficiency while ensuring smooth trajectories. Finally, this system was implemented and validated within an adaptive collaborative robot setup, which dynamically adjusts to different production environments and task requirements. Experimental results demonstrate the superiority of our multi‑stage grasping network in optimizing smart pharmacy operations, while also showcasing its remarkable adaptability and effectiveness in practical applications.

Abstract:
Efficient point cloud (PC) compression is crucial for streaming applications, such as augmented reality and cooperative perception. Classic PC compression techniques encode all the points in a frame. Tailoring compression towards perception tasks at the receiver side, we ask the question, "Can we remove the ground points during transmission without sacrificing the detection performance?" Our study reveals a strong dependency on the ground from state‑of‑the‑art (SOTA) 3D object detection models, especially on those points below and around the object. In this work, we propose a lightweight obstacle‑aware Pillar‑based Ground Removal (PGR) algorithm. PGR filters out ground points that do not provide context to object recognition, significantly improving compression ratio without sacrificing the receiver side perception performance. Not using heavy object detection or semantic segmentation models, PGR is light‑weight, highly parallelizable, and effective. Our evaluations on KITTI and Waymo Open Dataset show that SOTA detection models work equally well with PGR removing 20‑30% of the points, with a speeding of 86 FPS.

Abstract:
Capturing real‑world 3D spaces as point clouds is efficient and descriptive, but it comes with sensor errors and lacks object parametrization. These limitations render point clouds unsuitable for various real‑world applications, such as robot programming, without extensive post‑processing (e.g., outlier removal, semantic segmentation). On the other hand, CAD modeling provides high‑quality, parametric representations of 3D space with embedded semantic data, but requires manual component creation that is time‑consuming and costly. To address these challenges, we propose a novel solution that combines the strengths of both approaches. Our method for 3D workcell sketching from point clouds allows users to refine raw point clouds using an Augmented Reality (AR) interface that leverages their knowledge and the real‑world 3D environment. By utilizing a toolbox and an AR‑enabled pointing device, users can enhance point cloud accuracy based on the device's position in 3D space. We validate our approach by comparing it with ground truth models, demonstrating that it achieves a mean error within 1cm ‑ significant improvement over standard LiDAR scanner apps.

Abstract:
Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation applications, such as land cover mapping, urban planning, and environmental monitoring. However, individual data sources often present limitations for this task. Very High Resolution (VHR) aerial imagery provides rich spatial details but cannot capture temporal information about land cover changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine‑scale objects. This paper proposes a late fusion deep learning model (LF‑DLM) for semantic segmentation that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed model consists of two independent deep learning branches. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi‑Axis Vision Transformer (MaxViT) backbone. The other branch captures complex spatio‑temporal dynamics from the Sentinel‑2 satellite image time series using a U‑Net with Temporal Attention Encoder (U‑TAE). This approach leads to state‑of‑the‑art results on the FLAIR dataset, a large‑scale benchmark for land cover segmentation using multi‑source optical imagery. The findings highlight the importance of multi‑modality fusion in improving the accuracy and robustness of semantic segmentation in remote sensing applications.

Abstract:
Autonomous racing demands safe control of vehicles at their physical limits for extended periods of time, providing insights into advanced vehicle safety systems which increasingly rely on intervention provided by vehicle autonomy. Participation in this field carries with it a high barrier to entry. Physical platforms and their associated sensor suites require large capital outlays before any demonstrable progress can be made. Simulators allow researches to develop soft autonomous systems without purchasing a platform. However, currently available simulators lack visual and dynamic fidelity, can still be expensive to buy, lack customisation, and are difficult to use. AARK provides three packages, ACI, ACDG, and ACMPC. These packages enable research into autonomous control systems in the demanding environment of racing to bring more people into the field and improve reproducibility: ACI provides researchers with a computer vision‑friendly interface to Assetto Corsa for convenient comparison and evaluation of autonomous control solutions; ACDG enables generation of depth, normal and semantic segmentation data for training computer vision models to use in perception systems; and ACMPC gives newcomers to the field a modular full‑stack autonomous control solution, capable of controlling vehicles to build from. AARK aims to unify and democratise research into a field critical to providing safer roads and trusted autonomous systems.

Abstract:
Scene sketch semantic segmentation is a crucial task for various applications including sketch‑to‑image retrieval and scene understanding. Existing sketch segmentation methods treat sketches as bitmap images, leading to the loss of temporal order among strokes due to the shift from vector to image format. Moreover, these methods struggle to segment objects from categories absent in the training data. In this paper, we propose a Class‑Agnostic Visio‑Temporal Network (CAVT) for scene sketch semantic segmentation. CAVT employs a class‑agnostic object detector to detect individual objects in a scene and groups the strokes of instances through its post‑processing module. This is the first approach that performs segmentation at both the instance and stroke levels within scene sketches. Furthermore, there is a lack of free‑hand scene sketch datasets with both instance and stroke‑level class annotations. To fill this gap, we collected the largest Free‑hand Instance‑ and Stroke‑level Scene Sketch Dataset (FrISS) that contains 1K scene sketches and covers 403 object classes with dense annotations. Extensive experiments on FrISS and other datasets demonstrate the superior performance of our method over state‑of‑the‑art scene sketch segmentation models. The code and dataset will be made public after acceptance.

Abstract:
Convolutional neural networks (CNNs) achieve prevailing results in segmentation tasks nowadays and represent the state‑of‑the‑art for image‑based analysis. However, the understanding of the accurate decision‑making process of a CNN is rather unknown. The research area of explainable artificial intelligence (xAI) primarily revolves around understanding and interpreting this black‑box behavior. One way of interpreting a CNN is the use of class activation maps (CAMs) that represent heatmaps to indicate the importance of image areas for the prediction of the CNN. For classification tasks, a variety of CAM algorithms exist. But for segmentation tasks, only one CAM algorithm for the interpretation of the output of a CNN exist. We propose a transfer between existing classification‑ and segmentation‑based methods for more detailed, explainable, and consistent results which show salient pixels in semantic segmentation tasks. The resulting Seg‑HiRes‑Grad CAM is an extension of the segmentation‑based Seg‑Grad CAM with the transfer to the classification‑based HiRes CAM. Our method improves the previously‑mentioned existing segmentation‑based method by adjusting it to recently published classification‑based methods. Especially for medical image segmentation, this transfer solves existing explainability disadvantages.

Abstract:
Data augmentation is one of the most common tools in deep learning, underpinning many recent advances including tasks such as classification, detection, and semantic segmentation. The standard approach to data augmentation involves simple transformations like rotation and flipping to generate new images. However, these new images often lack diversity along the main semantic dimensions within the data. Traditional data augmentation methods cannot alter high‑level semantic attributes such as the presence of vehicles, trees, and buildings in a scene to enhance data diversity. In recent years, the rapid development of generative models has injected new vitality into the field of data augmentation. In this paper, we address the lack of diversity in data augmentation for road detection task by using a pre‑trained text‑to‑image diffusion model to parameterize image‑to‑image transformations. Our method involves editing images using these diffusion models to change their semantics. In essence, we achieve this goal by erasing instances of real objects from the original dataset and generating new instances with similar semantics in the erased regions using the diffusion model, thereby expanding the original dataset. We evaluate our approach on the KITTI road dataset and achieve the best results compared to other data augmentation methods, which demonstrates the effectiveness of our proposed development.

Abstract:
In the woodworking industry, a huge amount of effort has to be invested into the initial quality assessment of the raw material. In this study we present an AI model to detect, quantify and localize defects on wooden logs. This model aims to both automate the quality control process and provide a more consistent and reliable quality assessment. For this purpose a dataset of 1424 sample images of wood logs is created. A total of 5 annotators possessing different levels of expertise is involved in dataset creation. An inter‑annotator agreement analysis is conducted to analyze the impact of expertise on the annotation task and to highlight subjective differences in annotator judgement. We explore, train and fine‑tune the state‑of‑the‑art InternImage and ONE‑PEACE architectures for semantic segmentation. The best model created achieves an average IoU of 0.71, and shows detection and quantification capabilities close to the human annotators.

Abstract:
Unsupervised instance segmentation aims to segment distinct object instances in an image without relying on human‑labeled data. This field has recently seen significant advancements, partly due to the strong local correspondences afforded by rich visual feature representations from self‑supervised models (e.g., DINO). Recent state‑of‑the‑art approaches use self‑supervised features to represent images as graphs and solve a generalized eigenvalue system (i.e., normalized‑cut) to generate foreground masks. While effective, this strategy is limited by its attendant computational demands, leading to slow inference speeds. In this paper, we propose Prompt and Merge (ProMerge), which leverages self‑supervised visual features to obtain initial groupings of patches and applies a strategic merging to these segments, aided by a sophisticated background‑based mask pruning technique. ProMerge not only yields competitive results but also offers a significant reduction in inference time compared to state‑of‑the‑art normalized‑cut‑based approaches. Furthermore, when training an object detector using our mask predictions as pseudo‑labels, the resulting detector surpasses the current leading unsupervised model on various challenging instance segmentation benchmarks.

Abstract:
The successful deployment of deep learning‑based techniques for autonomous systems is highly dependent on the data availability for the respective system in its deployment environment. Especially for unstructured outdoor environments, very few datasets exist for even fewer robotic platforms and scenarios. In an earlier work, we presented the German Outdoor and Offroad Dataset (GOOSE) framework along with 10000 multimodal frames from an offroad vehicle to enhance the perception capabilities in unstructured environments. In this work, we address the generalizability of the GOOSE framework. To accomplish this, we open‑source the GOOSE‑Ex dataset, which contains additional 5000 labeled multimodal frames from various completely different environments, recorded on a robotic excavator and a quadruped platform. We perform a comprehensive analysis of the semantic segmentation performance on different platforms and sensor modalities in unseen environments. In addition, we demonstrate how the combined datasets can be utilized for different downstream applications or competitions such as offroad navigation, object manipulation or scene completion. The dataset, its platform documentation and pre‑trained state‑of‑the‑art models for offroad perception will be made available on https://goose‑dataset.de/. \

Abstract:
The human brain exhibits a strong ability to spontaneously associate different visual attributes of the same or similar visual scene, such as associating sketches and graffiti with real‑world visual objects, usually without supervising information. In contrast, in the field of artificial intelligence, controllable generation methods like ControlNet heavily rely on annotated training datasets such as depth maps, semantic segmentation maps, and poses, which limits the method's scalability. Inspired by the neural mechanisms that may contribute to the brain's associative power, specifically the cortical modularization and hippocampal pattern completion, here we propose a self‑supervised controllable generation (SCG) framework. Firstly, we introduce an equivariant constraint to promote inter‑module independence and intra‑module correlation in a modular autoencoder network, thereby achieving functional specialization. Subsequently, based on these specialized modules, we employ a self‑supervised pattern completion approach for controllable generation training. Experimental results demonstrate that the proposed modular autoencoder effectively achieves functional specialization, including the modular processing of color, brightness, and edge detection, and exhibits brain‑like features including orientation selectivity, color antagonism, and center‑surround receptive fields. Through self‑supervised training, associative generation capabilities spontaneously emerge in SCG, demonstrating excellent generalization ability to various tasks such as associative generation on painting, sketches, and ancient graffiti. Compared to the previous representative method ControlNet, our proposed approach not only demonstrates superior robustness in more challenging high‑noise scenarios but also possesses more promising scalability potential due to its self‑supervised manner.Codes are released on Github and Gitee.

Abstract:
This paper presents a novel weakly supervised semantic segmentation method for radar segmentation, where the existing LiDAR semantic segmentation models are employed to generate semantic labels, which then serve as supervision signals for training a radar semantic segmentation model. The obtained radar semantic segmentation model outperforms LiDAR‑based models, providing more consistent and robust segmentation under all‑weather conditions, particularly in the snow, rain and fog. To mitigate potential errors in LiDAR semantic labels, we design a dedicated refinement scheme that corrects erroneous labels based on structural features and distribution patterns. The semantic information generated by our radar segmentation model is used in two downstream tasks, achieving significant performance improvements. In large‑scale radar‑based localization using OpenStreetMap, it leads to localization error reduction by 20.55% over prior methods. For the odometry task, it improves translation accuracy by 16.4% compared to the second‑best method, securing the first place in the radar odometry competition at the Radar in Robotics workshop of ICRA 2024, Japan

Abstract:
Open‑vocabulary 3D segmentation enables exploration of 3D spaces using free‑form text descriptions. Existing methods for open‑vocabulary 3D instance segmentation primarily focus on identifying object‑level instances but struggle with finer‑grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach to construct hierarchical open‑vocabulary 3D scene representations, enabling 3D search at multiple levels of granularity: fine‑grained object parts, entire objects, or regions described by attributes like materials. Unlike prior methods, Search3D shifts towards a more flexible open‑vocabulary 3D search paradigm, moving beyond explicit object‑centric queries. For systematic evaluation, we further contribute a scene‑scale open‑vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open‑vocabulary fine‑grained part annotations on ScanNet++. Search3D outperforms baselines in scene‑scale open‑vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials. Our project page is http://search3d‑segmentation.github.io.

Abstract:
This paper is directed towards the food crystal quality control area for manufacturing, focusing on efficiently predicting food crystal counts and size distributions. Previously, manufacturers used the manual counting method on microscopic images of food liquid products, which requires substantial human effort and suffers from inconsistency issues. Food crystal segmentation is a challenging problem due to the diverse shapes of crystals and their surrounding hard mimics. To address this challenge, we propose an efficient instance segmentation method based on object detection. Experimental results show that the predicted crystal counting accuracy of our method is comparable with existing segmentation methods, while being five times faster. Based on our experiments, we also define objective criteria for separating hard mimics and food crystals, which could benefit manual annotation tasks on similar dataset.

Abstract:
End‑to‑end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird's‑Eye‑View (BEV) representations, we propose a novel DRL‑based end‑to‑end driving framework that utilizes multi‑sensor inputs to construct a unified three‑dimensional understanding of the environment. This BEV‑based system extracts and translates critical environmental features into high‑level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state‑of‑the‑art methods in autonomous driving control tasks, reducing the collision rate by 20%.

Abstract:
Emerging of visual language models, such as the segment anything model (SAM), have made great breakthroughs in the field of universal semantic segmentation and significantly aid the improvements of medical image segmentation, in particular with the help of Medical SAM adaptor (Med‑SA). However, Med‑SA still can be improved, as it fine‑tunes SAM in a partial adaption manner. To resolve this problem, we present a novel global medical SAM adaptor (GMed‑SA) with full adaption, which can adapt SAM globally. We further combine GMed‑SA and Med‑SA to propose a global‑local medical SAM adaptor (GLMed‑SA) to adapt SAM both globally and locally. Extensive experiments have been performed on the challenging public 2D melanoma segmentation dataset. The results show that GLMed‑SA outperforms several state‑of‑the‑art semantic segmentation methods on various evaluation metrics, demonstrating the superiority of our methods.

Abstract:
Semantic segmentation networks have achieved significant success under the assumption of independent and identically distributed data. However, these networks often struggle to detect anomalies from unknown semantic classes due to the limited set of visual concepts they are typically trained on. To address this issue, anomaly segmentation often involves fine‑tuning on outlier samples, necessitating additional efforts for data collection, labeling, and model retraining. Seeking to avoid this cumbersome work, we take a different approach and propose to incorporate Vision‑Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre‑training for improved outlier awareness. Additionally, we propose a new scoring function that enables data‑ and training‑free outlier supervision via textual prompts. The resulting VL4AD model, which includes max‑logit prompt ensembling and a class‑merging strategy, achieves competitive performance on widely used benchmark datasets, thereby demonstrating the potential of vision‑language models for pixel‑wise anomaly detection.

Abstract:
We introduce Go‑SLAM, a novel framework that utilizes 3D Gaussian Splatting SLAM to reconstruct dynamic environments while embedding object‑level information within the scene representations. This framework employs advanced object segmentation techniques, assigning a unique identifier to each Gaussian splat that corresponds to the object it represents. Consequently, our system facilitates open‑vocabulary querying, allowing users to locate objects using natural language descriptions. Furthermore, the framework features an optimal path generation module that calculates efficient navigation paths for robots toward queried objects, considering obstacles and environmental uncertainties. Comprehensive evaluations in various scene settings demonstrate the effectiveness of our approach in delivering high‑fidelity scene reconstructions, precise object segmentation, flexible object querying, and efficient robot path planning. This work represents an additional step forward in bridging the gap between 3D scene reconstruction, semantic object understanding, and real‑time environment interactions.

Abstract:
Segmentation is a crucial step in microscopy image analysis. Numerous approaches have been developed over the past years, ranging from classical segmentation algorithms to advanced deep learning models. While U‑Net remains one of the most popular and well‑established models for biomedical segmentation tasks, recently developed transformer‑based models promise to enhance the segmentation process of microscopy images. In this work, we assess the efficacy of transformers, including UNETR, the Segment Anything Model, and Swin‑UPerNet, and compare them with the well‑established U‑Net model across various image modalities such as electron microscopy, brightfield, histopathology, and phase‑contrast. Our evaluation identifies several limitations in the original Swin Transformer model, which we address through architectural modifications to optimise its performance. The results demonstrate that these modifications improve segmentation performance compared to the classical U‑Net model and the unmodified Swin‑UPerNet. This comparative analysis highlights the promise of transformer models for advancing biomedical image segmentation. It demonstrates that their efficiency and applicability can be improved with careful modifications, facilitating their future use in microscopy image analysis tools.

Abstract:
While deep learning has catalyzed breakthroughs across numerous domains, its broader adoption in clinical settings is inhibited by the costly and time‑intensive nature of data acquisition and annotation. To further facilitate medical machine learning, we present an ultrasound dataset of 10,223 Brightness‑mode (B‑mode) images consisting of sagittal slices of porcine spinal cords (N=25) before and after a contusion injury. We additionally benchmark the performance metrics of several state‑of‑the‑art object detection algorithms to localize the site of injury and semantic segmentation models to label the anatomy for comparison and creation of task‑specific architectures. Finally, we evaluate the zero‑shot generalization capabilities of the segmentation models on human ultrasound spinal cord images to determine whether training on our porcine dataset is sufficient for accurately interpreting human data. Our results show that the YOLOv8 detection model outperforms all evaluated models for injury localization, achieving a mean Average Precision (mAP50‑95) score of 0.606. Segmentation metrics indicate that the DeepLabv3 segmentation model achieves the highest accuracy on unseen porcine anatomy, with a Mean Dice score of 0.587, while SAMed achieves the highest Mean Dice score generalizing to human anatomy (0.445). To the best of our knowledge, this is the largest annotated dataset of spinal cord ultrasound images made publicly available to researchers and medical professionals, as well as the first public report of object detection and segmentation architectures to assess anatomical markers in the spinal cord for methodology development and clinical applications.

Abstract:
The National Bridge Inspection Standards require detailed element‑level bridge inspections. Traditionally, inspectors manually assign condition ratings by rating structural components based on damage, but this process is labor‑intensive and time‑consuming. Automating the element‑level bridge inspection process can facilitate more comprehensive condition documentation to improve overall bridge management. While semantic segmentation of bridge point clouds has been studied, research on instance segmentation of bridge elements is limited, partly due to the lack of annotated datasets, and the difficulty in generalizing trained models. To address this, we propose a novel approach for generating synthetic data using three distinct methods. Our framework leverages the Mask3D transformer model, optimized with hyperparameter tuning and a novel occlusion technique. The model achieves state‑of‑the‑art performance on real LiDAR and photogrammetry bridge point clouds, respectively, demonstrating the potential of the framework for automating element‑level bridge inspections.

Abstract:
Crop field boundaries are foundational datasets for agricultural monitoring and assessments but are expensive to collect manually. Machine learning (ML) methods for automatically extracting field boundaries from remotely sensed images could help realize the demand for these datasets at a global scale. However, current ML methods for field instance segmentation lack sufficient geographic coverage, accuracy, and generalization capabilities. Further, research on improving ML methods is restricted by the lack of labeled datasets representing the diversity of global agricultural fields. We present Fields of The World (FTW) ‑‑ a novel ML benchmark dataset for agricultural field instance segmentation spanning 24 countries on four continents (Europe, Africa, Asia, and South America). FTW is an order of magnitude larger than previous datasets with 70,462 samples, each containing instance and semantic segmentation masks paired with multi‑date, multi‑spectral Sentinel‑2 satellite images. We provide results from baseline models for the new FTW benchmark, show that models trained on FTW have better zero‑shot and fine‑tuning performance in held‑out countries than models that aren't pre‑trained with diverse datasets, and show positive qualitative zero‑shot results of FTW models in a real‑world scenario ‑‑ running on Sentinel‑2 scenes over Ethiopia.

Abstract:
We study behavior change‑based visual risk object identification (Visual‑ROI), a critical framework designed to detect potential hazards for intelligent driving systems. Existing methods often show significant limitations in spatial accuracy and temporal consistency, stemming from an incomplete understanding of scene affordance. For example, these methods frequently misidentify vehicles that do not impact the ego vehicle as risk objects. Furthermore, existing behavior change‑based methods are inefficient because they implement causal inference in the perspective image space. We propose a new framework with a Bird's Eye View (BEV) representation to overcome the above challenges. Specifically, we utilize potential fields as scene affordance, involving repulsive forces derived from road infrastructure and traffic participants, along with attractive forces sourced from target destinations. In this work, we compute potential fields by assigning different energy levels according to the semantic labels obtained from BEV semantic segmentation. We conduct thorough experiments and ablation studies, comparing the proposed method with various state‑of‑the‑art algorithms on both synthetic and real‑world datasets. Our results show a notable increase in spatial and temporal consistency, with enhancements of 20.3% and 11.6% on the RiskBench dataset, respectively. Additionally, we can improve computational efficiency by 88%. We achieve improvements of 5.4% in spatial accuracy and 7.2% in temporal consistency on the nuScenes dataset.

Abstract:
Weakly supervised semantic segmentation (WSSS) approaches typically rely on class activation maps (CAMs) for initial seed generation, which often fail to capture global context due to limited supervision from image‑level labels. To address this issue, we introduce DALNet, Dense Alignment Learning Network that leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our key insight is to employ a dual‑level alignment strategy: (1) Global Implicit Alignment (GIA) to capture global semantics by maximizing the similarity between the class token and the corresponding text embeddings while minimizing the similarity with background embeddings, and (2) Local Explicit Alignment (LEA) to improve object localization by utilizing spatial information from patch tokens. Moreover, we propose a cross‑contrastive learning approach that aligns foreground features between image and text modalities while separating them from the background, encouraging activation in missing regions and suppressing distractions. Through extensive experiments on the PASCAL VOC and MS COCO datasets, we demonstrate that DALNet significantly outperforms state‑of‑the‑art WSSS methods. Our approach, in particular, allows for more efficient end‑to‑end process as a single‑stage method.

Abstract:
Natural environments pose significant challenges for autonomous robot navigation, particularly due to their unstructured and ever‑changing nature. Hiking trails, with their dynamic conditions influenced by weather, vegetation, and human traffic, represent one such challenge. This work introduces a novel approach to autonomous hiking trail navigation that balances trail adherence with the flexibility to adapt to off‑trail routes when necessary. The solution is a Traversability Analysis module that integrates semantic data from camera images with geometric information from LiDAR to create a comprehensive understanding of the surrounding terrain. A planner uses this traversability map to navigate safely, adhering to trails while allowing off‑trail movement when necessary to avoid on‑trail hazards or for safe off‑trail shortcuts. The method is evaluated through simulation to determine the balance between semantic and geometric information in traversability estimation. These simulations tested various weights to assess their impact on navigation performance across different trail scenarios. Weights were then validated through field tests at the West Virginia University Core Arboretum, demonstrating the method's effectiveness in a real‑world environment.

Abstract:
Unseen Object Instance Segmentation (UOIS) is crucial for autonomous robots operating in unstructured environments. Previous approaches require full supervision on large‑scale tabletop datasets for effective pretraining. In this paper, we propose UOIS‑SAM, a data‑efficient solution for the UOIS task that leverages SAM's high accuracy and strong generalization capabilities. UOIS‑SAM integrates two key components: (i) a Heatmap‑based Prompt Generator (HPG) to generate class‑agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder, mitigating issues introduced by the SAM baseline, such as background confusion and over‑segmentation, especially in scenarios involving occlusion and texture‑rich objects. Extensive experimental results on OCID, OSD, and additional photometrically challenging datasets including PhoCAL and HouseCat6D, demonstrate that, even using only 10% of the training samples compared to previous methods, UOIS‑SAM achieves state‑of‑the‑art performance in unseen object segmentation, highlighting its effectiveness and robustness in various tabletop scenes.

Abstract:
This study explores human‑robot interaction (HRI) based on a mobile robot and YOLO to increase real‑time situation awareness and prevent accidents in the workplace. Using object segmentation, we propose an approach that is capable of analyzing these situations in real‑time and providing useful information to avoid critical working situations. In the industry, ensuring the safety of workers is paramount, and solutions based on robots and AI can provide a safer environment. For that, we proposed a methodology evaluated with two different YOLO versions (YOLOv8 and YOLOv5) alongside a LoCoBot robot for supervision and to perform the interaction with a user. We show that our proposed approach is capable of navigating a test scenario and issuing alerts via Text‑to‑Speech when dangerous situations are faced, such as when hardhats and safety vests are not detected. Based on the results gathered, we can conclude that our system is capable of detecting and informing risk situations such as helmet/no helmet and safety vest/no safety vest situations.

Abstract:
Scene Change Detection is a challenging task in computer vision and robotics that aims to identify differences between two images of the same scene captured at different times. Traditional change detection methods rely on training models that take these image pairs as input and estimate the changes, which requires large amounts of annotated data, a costly and time‑consuming process. To overcome this, we propose ZeroSCD, a zero‑shot scene change detection framework that eliminates the need for training. ZeroSCD leverages pre‑existing models for place recognition and semantic segmentation, utilizing their features and outputs to perform change detection. In this framework, features extracted from the place recognition model are used to estimate correspondences and detect changes between the two images. These are then combined with segmentation results from the semantic segmentation model to precisely delineate the boundaries of the detected changes. Extensive experiments on benchmark datasets demonstrate that ZeroSCD outperforms several state‑of‑the‑art methods in change detection accuracy, despite not being trained on any of the benchmark datasets, proving its effectiveness and adaptability across different scenarios.

Abstract:
Vision‑based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real‑world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion‑based framework to address the RGB‑D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB‑D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State‑of‑the‑Art performance on both the NYUv2 and SUN‑RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at https://diffusionmms.github.io/

Abstract:
We propose the unified BRAVO challenge to benchmark the reliability of semantic segmentation models under realistic perturbations and unknown out‑of‑distribution (OOD) scenarios. We define two categories of reliability: (1) semantic reliability, which reflects the model's accuracy and calibration when exposed to various perturbations; and (2) OOD reliability, which measures the model's ability to detect object classes that are unknown during training. The challenge attracted nearly 100 submissions from international teams representing notable research institutions. The results reveal interesting insights into the importance of large‑scale pre‑training and minimal architectural design in developing robust and reliable semantic segmentation models.

Abstract:
Memory‑based video object segmentation methods model multiple objects over long temporal‑spatial spans by establishing memory bank, which achieve the remarkable performance. However, they struggle to overcome the false matching and are prone to lose critical information, resulting in confusion among different objects. In this paper, we propose an effective approach which jointly improving the matching and decoding stages to alleviate the false matching issue.For the memory matching stage, we present a cost aware mechanism that suppresses the slight errors for short‑term memory and a shunted cross‑scale matching for long‑term memory which establish a wide filed matching spaces for various object scales. For the readout decoding stage, we implement a compensatory mechanism aims at recovering the essential information where missing at the matching stage. Our approach achieves the outstanding performance in several popular benchmarks (i.e., DAVIS 2016&2017 Val (92.4%&88.1%), and DAVIS 2017 Test (83.9%)), and achieves 84.8%&84.6% on YouTubeVOS 2018&2019 Val.

Abstract:
Accurately reconstructing dense and semantically annotated 3D meshes from monocular images remains a challenging task due to the lack of geometry guidance and imperfect view‑dependent 2D priors. Though we have witnessed recent advancements in implicit neural scene representations enabling precise 2D rendering simply from multi‑view images, there have been few works addressing 3D scene understanding with monocular priors alone. In this paper, we propose MOSE, a neural field semantic reconstruction approach to lift inferred image‑level noisy priors to 3D, producing accurate semantics and geometry in both 3D and 2D space. The key motivation for our method is to leverage generic class‑agnostic segment masks as guidance to promote local consistency of rendered semantics during training. With the help of semantics, we further apply a smoothness regularization to texture‑less regions for better geometric quality, thus achieving mutual benefits of geometry and semantics. Experiments on the ScanNet dataset show that our MOSE outperforms relevant baselines across all metrics on tasks of 3D semantic segmentation, 2D semantic segmentation and 3D surface reconstruction.

Abstract:
Semantic segmentation of large‑scale outdoor point clouds is of significant importance in environment perception and scene understanding. However, this task continues to present a significant research challenge, due to the inherent complexity of outdoor objects and their diverse distributions in real‑world environments. In this study, we propose the Multilateral Cascading Network (MCNet) designed to address this challenge. The model comprises two key components: a Multilateral Cascading Attention Enhancement (MCAE) module, which facilitates the learning of complex local features through multilateral cascading operations; and a Point Cross Stage Partial (P‑CSP) module, which fuses global and local features, thereby optimizing the integration of valuable feature information across multiple scales. Our proposed method demonstrates superior performance relative to state‑of‑the‑art approaches across two widely recognized benchmark datasets: Toronto3D and SensatUrban. Especially on the city‑scale SensatUrban dataset, our results surpassed the current best result by 2.1% in overall mIoU and yielded an improvement of 15.9% on average for small‑sample object categories comprising less than 2% of the total samples, in comparison to the baseline method.

Abstract:
To ease the difficulty of acquiring annotation labels in 3D data, a common method is using unsupervised and open‑vocabulary semantic segmentation, which leverage 2D CLIP semantic knowledge. In this paper, unlike previous research that ignores the ``noise'' raised during feature projection from 2D to 3D, we propose a novel distillation learning framework named CUS3D. In our approach, an object‑level denosing projection module is designed to screen out the ``noise'' and ensure more accurate 3D feature. Based on the obtained features, a multimodal distillation learning module is designed to align the 3D feature with CLIP semantic feature space with object‑centered constrains to achieve advanced unsupervised semantic segmentation. We conduct comprehensive experiments in both unsupervised and open‑vocabulary segmentation, and the results consistently showcase the superiority of our model in achieving advanced unsupervised segmentation results and its effectiveness in open‑vocabulary segmentation.

Abstract:
This work addresses how to efficiently classify challenging histopathology images, such as gigapixel whole‑slide images for cancer diagnostics with image‑level annotation. We use images with annotated tumor regions to identify a set of tumor patches and a set of benign patches in a cancerous slide. Due to the variable nature of region of interest the tumor positive regions may refer to an extreme minority of the pixels. This creates an important problem during patch‑level classification, where the majority of patches from an image labeled as 'cancerous' are actually tumor‑free. This problem is different from semantic segmentation which associates a label to every pixel in an image, because after patch extraction we are only dealing with patch‑level labels.Most existing approaches address the data imbalance issue by mitigating the data shortage in minority classes in order to prevent the model from being dominated by the majority classes. These methods include data re‑sampling, loss re‑weighting, margin modification, and data augmentation. In this work, we mitigate the patch‑level class imbalance problem by taking a divide‑and‑conquer approach. First, we partition the data into sub‑groups, and define three separate classification sub‑problems based on these data partitions. Then, using an information‑theoretic cluster‑based sampling of deep image patch features, we sample discriminative patches from the sub‑groups. Using these sampled patches, we build corresponding deep models to solve the new classification sub‑problems. Finally, we integrate information learned from the respective models to make a final decision on the patches. Our result shows that the proposed approach can perform competitively using a very low percentage of the available patches in a given whole‑slide image.

Abstract:
Class‑agnostic image segmentation is a crucial component in automating image editing workflows, especially in contexts where object selection traditionally involves interactive tools. Existing methods in the literature often adhere to top‑down formulations, following the paradigm of class‑based approaches, where object detection precedes per‑object segmentation. In this work, we present a novel bottom‑up formulation for addressing the class‑agnostic segmentation problem. We supervise our network directly on the projective sphere of its feature space, employing losses inspired by metric learning literature as well as losses defined in a novel segmentation‑space representation. The segmentation results are obtained through a straightforward mean‑shift clustering of the estimated features. Our bottom‑up formulation exhibits exceptional generalization capability, even when trained on datasets designed for class‑based segmentation. We further showcase the effectiveness of our generic approach by addressing the challenging task of cell and nucleus segmentation. We believe that our bottom‑up formulation will offer valuable insights into diverse segmentation challenges in the literature.

Abstract:
Simulation‑based testing is widely used to assess the reliability of Autonomous Driving Systems (ADS), but its effectiveness is limited by the operational design domain (ODD) conditions available in such simulators. To address this limitation, in this work, we explore the integration of generative artificial intelligence techniques with physics‑based simulators to enhance ADS system‑level testing. Our study evaluates the effectiveness and computational overhead of three generative strategies based on diffusion models, namely instruction‑editing, inpainting, and inpainting with refinement. Specifically, we assess these techniques' capabilities to produce augmented simulator‑generated images of driving scenarios representing new ODDs. We employ a novel automated detector for invalid inputs based on semantic segmentation to ensure semantic preservation and realism of the neural generated images. We then perform system‑level testing to evaluate the ADS's generalization ability to newly synthesized ODDs. Our findings show that diffusion models help increase the ODD coverage for system‑level testing of ADS. Our automated semantic validator achieved a percentage of false positives as low as 3%, retaining the correctness and quality of the generated images for testing. Our approach successfully identified new ADS system failures before real‑world testing.

Abstract:
Large Language Models (LLMs) have demonstrated impressive performance across various tasks. However, current training approaches combine standard cross‑entropy loss with extensive data, human feedback, or ad hoc methods to enhance performance. These solutions are often not scalable or feasible due to their associated costs, complexity, or resource requirements. This study investigates the use of established semantic segmentation loss functions in natural language generation to create a versatile, practical, and scalable solution for fine‑tuning different architectures. We evaluate their effectiveness in solving Math Word Problems and question answering across different models of varying sizes. For the analyzed tasks, we found that the traditional Cross‑Entropy loss represents a sub‑optimal choice, while models trained to minimize alternative (task‑dependent) losses, such as Focal or Lovász, achieve a mean improvement of +42% on exact match without requiring additional data or human feedback. These findings suggest a promising pathway for more efficient and accessible training processes.

Abstract:
With the development of 3D and 2D data acquisition techniques, it has become easy to obtain point clouds and images of scenes simultaneously, which further facilitates dual‑modal semantic segmentation. Most existing methods for simultaneously segmenting point clouds and images rely heavily on the quantity and quality of the labeled training data. However, massive point‑wise and pixel‑wise labeling procedures are time‑consuming and labor‑intensive. To address this issue, we propose a parallel dual‑stream network to handle the semi‑supervised dual‑modal semantic segmentation task, called PD‑Net, by jointly utilizing a small number of labeled point clouds, a large number of unlabeled point clouds, and unlabeled images. The proposed PD‑Net consists of two parallel streams (called original stream and pseudo‑label prediction stream). The pseudo‑label prediction stream predicts the pseudo labels of unlabeled point clouds and their corresponding images. Then, the unlabeled data is sent to the original stream for self‑training. Each stream contains two encoder‑decoder branches for 3D and 2D data respectively. In each stream, multiple dual‑modal fusion modules are explored for fusing the dual‑modal features. In addition, a pseudo‑label optimization module is explored to optimize the pseudo labels output by the pseudo‑label prediction stream. Experimental results on two public datasets demonstrate that the proposed PD‑Net not only outperforms the comparative semi‑supervised methods but also achieves competitive performances with some fully‑supervised methods in most cases.

Abstract:
In this study, we implemented a two‑stage deep learning‑based approach to segment lesions in PET/CT images for the AutoPET III challenge. The first stage utilized a DynUNet model for coarse segmentation, identifying broad regions of interest. The second stage refined this segmentation using an ensemble of SwinUNETR, SegResNet, and UNet models. Preprocessing involved resampling images to a common resolution and normalization, while data augmentation techniques such as affine transformations and intensity adjustments were applied to enhance model generalization. The dataset was split into 80% training and 20% validation, excluding healthy cases. This method leverages multi‑stage segmentation and model ensembling to achieve precise lesion segmentation, aiming to improve robustness and overall performance.

Abstract:
In Canada's northern regions, linear disturbances such as roads, seismic exploration lines, and pipelines pose a significant threat to the boreal woodland caribou population (Rangifer tarandus). To address the critical need for management of these disturbances, there is a strong emphasis on developing mapping approaches that accurately identify forest habitat fragmentation. The traditional approach is manually generating maps, which is time‑consuming and lacks the capability for frequent updates. Instead, applying deep learning methods to multispectral satellite imagery offers a cost‑effective solution for automated and regularly updated map production. Deep learning models have shown promise in extracting paved roads in urban environments when paired with high‑resolution (<0.5m) imagery, but their effectiveness for general linear feature extraction in forested areas from lower resolution imagery remains underexplored. This research employs a deep convolutional neural network model based on the VGGNet16 architecture for semantic segmentation of lower resolution (10m) Sentinel‑2 satellite imagery, creating precise multi‑class linear disturbance maps. The model is trained using ground‑truth label maps sourced from the freely available Alberta Institute of Biodiversity Monitoring Human Footprint dataset, specifically targeting the Boreal and Taiga Plains ecozones in Alberta, Canada. Despite challenges in segmenting lower resolution imagery, particularly for thin linear disturbances like seismic exploration lines that can exhibit a width of 1‑3 pixels in Sentinel‑2 imagery, our results demonstrate the effectiveness of the VGGNet model for accurate linear disturbance retrieval. By leveraging the freely available Sentinel‑2 imagery, this work advances cost‑effective automated mapping techniques for identifying and monitoring linear disturbance fragmentation.

Abstract:
The process of fish cage inspections, which is a necessary maintenance task at any fish farm, be it small scale or industrial, is a task that has the potential to be fully automated. Replacing trained divers who perform regular inspections with autonomous marine vehicles would lower the costs of manpower and remove the risks associated with humans performing underwater inspections. Achieving such a level of autonomy implies developing an image processing algorithm that is capable of estimating the state of biofouling buildup. The aim of this work is to propose a complete solution for automating the said inspection process; from developing an autonomous control algorithm for an ROV, to automatically segmenting images of fish cages, and accurately estimating the state of biofouling. The first part is achieved by modifying a commercially available ROV with an acoustic SBL positioning system and developing a closed‑loop control system. The second part is realized by implementing a proposed biofouling estimation framework, which relies on AI to perform image segmentation, and by processing images using established computer vision methods to obtain a rough estimate of the distance of the ROV from the fish cage. This also involved developing a labeling tool in order to create a dataset of images for the neural network performing the semantic segmentation to be trained on. The experimental results show the viability of using an ROV fitted with an acoustic transponder for autonomous missions, and demonstrate the biofouling estimation framework's ability to provide accurate assessments, alongside satisfactory distance estimation capabilities. In conclusion, the achieved biofouling estimation accuracy showcases clear potential for use in the aquaculture industry.

Abstract:
In fine‑grained road scene understanding, semantic segmentation plays a crucial role in enabling vehicles to perceive and comprehend their surroundings. By assigning a specific class label to each pixel in an image, it allows for precise identification and localization of detailed road features, which is vital for high‑quality scene understanding and downstream perception tasks. A key challenge in this domain lies in improving the recognition performance of minority classes while mitigating the dominance of majority classes, which is essential for achieving balanced and robust overall performance. However, traditional semi‑supervised learning methods often train models overlooking the imbalance between classes. To address this issue, firstly, we propose a general training module that learns from all the pseudo‑labels without a conventional filtering strategy. Secondly, we propose a professional training module to learn specifically from reliable minority‑class pseudo‑labels identified by a novel mismatch score metric. The two modules are crossly supervised by each other so that it reduces model coupling which is essential for semi‑supervised learning. During contrastive learning, to avoid the dominance of the majority classes in the feature space, we propose a strategy to assign evenly distributed anchors for different classes in the feature space. Experimental results on multiple public benchmarks show that our method surpasses traditional approaches in recognizing tail classes.

Abstract:
Grasp detection requires flexibility to handle objects of various shapes without relying on prior knowledge of the object, while also offering intuitive, user‑guided control. This paper introduces GraspSAM, an innovative extension of the Segment Anything Model (SAM), designed for prompt‑driven and category‑agnostic grasp detection. Unlike previous methods, which are often limited by small‑scale training data, GraspSAM leverages the large‑scale training and prompt‑based segmentation capabilities of SAM to efficiently support both target‑object and category‑agnostic grasping. By utilizing adapters, learnable token embeddings, and a lightweight modified decoder, GraspSAM requires minimal fine‑tuning to integrate object segmentation and grasp prediction into a unified framework. The model achieves state‑of‑the‑art (SOTA) performance across multiple datasets, including Jacquard, Grasp‑Anything, and Grasp‑Anything++. Extensive experiments demonstrate the flexibility of GraspSAM in handling different types of prompts (such as points, boxes, and language), highlighting its robustness and effectiveness in real‑world robotic applications.

Abstract:
Extracting hepatic vessels from abdominal images is of high interest for clinicians since it allows to divide the liver into functionally‑independent Couinaud segments. In this respect, an automated liver blood vessel extraction is widely summoned. Despite the significant growth in performance of semantic segmentation methodologies, preserving the complex multi‑scale geometry of main vessels and ramifications remains a major challenge. This paper provides a new deep supervised approach for vessel segmentation, with a strong focus on representations arising from the different scales inherent to the vascular tree geometry. In particular, we propose a new clustering technique to decompose the tree into various scale levels, from tiny to large vessels. Then, we extend standard 3D UNet to multi‑task learning by incorporating scale‑specific auxiliary tasks and contrastive learning to encourage the discrimination between scales in the shared representation. Promising results, depicted in several evaluation metrics, are revealed on the public 3D‑IRCADb dataset.

Abstract:
With the ever‑growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real‑world applications.

Abstract:
Representing the 3D environment with instance‑aware semantic and geometric information is crucial for interaction‑aware robots in dynamic environments. Nevertheless, creating such a representation poses challenges due to sensor noise, instance segmentation and tracking errors, and the objects' dynamic motion. This paper introduces a novel particle‑based instance‑aware semantic occupancy map to tackle these challenges. Particles with an augmented instance state are used to estimate the Probability Hypothesis Density (PHD) of the objects and implicitly model the environment. Utilizing a State‑augmented Sequential Monte Carlo PHD (S^2MC‑PHD) filter, these particles are updated to jointly estimate occupancy status, semantic, and instance IDs, mitigating noise. Additionally, a memory module is adopted to enhance the map's responsiveness to previously observed objects. Experimental results on the Virtual KITTI 2 dataset demonstrate that the proposed approach surpasses state‑of‑the‑art methods across multiple metrics under different noise conditions. Subsequent tests using real‑world data further validate the effectiveness of the proposed approach.

Abstract:
Learning with limited labelled data is a challenging problem in various applications, including remote sensing. Few‑shot semantic segmentation is one approach that can encourage deep learning models to learn from few labelled examples for novel classes not seen during the training. The generalized few‑shot segmentation setting has an additional challenge which encourages models not only to adapt to the novel classes but also to maintain strong performance on the training base classes. While previous datasets and benchmarks discussed the few‑shot segmentation setting in remote sensing, we are the first to propose a generalized few‑shot segmentation benchmark for remote sensing. The generalized setting is more realistic and challenging, which necessitates exploring it within the remote sensing context. We release the dataset augmenting OpenEarthMap with additional classes labelled for the generalized few‑shot evaluation setting. The dataset is released during the OpenEarthMap land cover mapping generalized few‑shot challenge in the L3D‑IVU workshop in conjunction with CVPR 2024. In this work, we summarize the dataset and challenge details in addition to providing the benchmark results on the two phases of the challenge for the validation and test sets.

Abstract:
Image classification models, including convolutional neural networks (CNNs), perform well on a variety of classification tasks but struggle under conditions of partial occlusion, i.e., conditions in which objects are partially covered from the view of a camera. Methods to improve performance under occlusion, including data augmentation, part‑based clustering, and more inherently robust architectures, including Vision Transformer (ViT) models, have, to some extent, been evaluated on their ability to classify objects under partial occlusion. However, evaluations of these methods have largely relied on images containing artificial occlusion, which are typically computer‑generated and therefore inexpensive to label. Additionally, methods are rarely compared against each other, and many methods are compared against early, now outdated, deep learning models. We contribute the Image Recognition Under Occlusion (IRUO) dataset, based on the recently developed Occluded Video Instance Segmentation (OVIS) dataset (arXiv:2102.01558). IRUO utilizes real‑world and artificially occluded images to test and benchmark leading methods' robustness to partial occlusion in visual recognition tasks. In addition, we contribute the design and results of a human study using images from IRUO that evaluates human classification performance at multiple levels and types of occlusion. We find that modern CNN‑based models show improved recognition accuracy on occluded images compared to earlier CNN‑based models, and ViT‑based models are more accurate than CNN‑based models on occluded images, performing only modestly worse than human accuracy. We also find that certain types of occlusion, including diffuse occlusion, where relevant objects are seen through "holes" in occluders such as fences and leaves, can greatly reduce the accuracy of deep recognition models as compared to humans, especially those with CNN backbones.

Abstract:
Comparing to deep neural networks trained for specific tasks, those foundational deep networks trained on generic datasets such as ImageNet classification, benefits from larger‑scale datasets, simpler network structure and easier training techniques. In this paper, we design a prompting module which performs few‑shot adaptation of generic deep networks to new tasks. Driven by learning theory, we derive prompting modules that are as simple as possible, as they generalize better under the same training error. We use a case study on video object segmentation to experiment. We give a concrete prompting module, the Semi‑parametric Deep Forest (SDForest) that combines several nonparametric methods such as correlation filter, random forest, image‑guided filter, with a deep network trained for ImageNet classification task. From a learning‑theoretical point of view, all these models are of significantly smaller VC dimension or complexity so tend to generalize better, as long as the empirical studies show that the training error of this simple ensemble can achieve comparable results from a end‑to‑end trained deep network. We also propose a novel methods of analyzing the generalization under the setting of video object segmentation to make the bound tighter. In practice, SDForest has extremely low computation cost and achieves real‑time even on CPU. We test on video object segmentation tasks and achieve competitive performance at DAVIS2016 and DAVIS2017 with purely deep learning approaches, without any training or fine‑tuning.

Abstract:
We present a novel frequency‑based Self‑Supervised Learning (SSL) approach that significantly enhances its efficacy for pre‑training. Prior work in this direction masks out pre‑defined frequencies in the input image and employs a reconstruction loss to pre‑train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre‑defined frequencies overlooks the variability of image frequency responses. Second, pre‑trained with frequency‑filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine‑tuning. To address these drawbacks, we propose FOurier transform compression with seLf‑Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked‑out frequencies based on image frequency responses, creating more suitable SSL tasks for pre‑training. Second, we employ a two‑branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state‑of‑the‑art SSL methods across various downstream tasks, including image classification, few‑shot learning, and semantic segmentation.

Abstract:
Large‑scale semantic segmentation networks often achieve high performance, while their application can be challenging when faced with limited sample sizes and computational resources. In scenarios with restricted network size and computational complexity, models encounter significant challenges in capturing long‑range dependencies and recovering detailed information in images. We propose a lightweight bilateral semantic segmentation network called bilateral attention fusion network (BAFNet) to efficiently segment high‑resolution urban remote sensing images. The model consists of two paths, namely dependency path and remote‑local path. The dependency path utilizes large kernel attention to acquire long‑range dependencies in the image. Besides, multi‑scale local attention and efficient remote attention are designed to construct remote‑local path. Finally, a feature aggregation module is designed to effectively utilize the different features of the two paths. Our proposed method was tested on public high‑resolution urban remote sensing datasets Vaihingen and Potsdam, with mIoU reaching 83.20% and 86.53%, respectively. As a lightweight semantic segmentation model, BAFNet not only outperforms advanced lightweight models in accuracy but also demonstrates comparable performance to non‑lightweight state‑of‑the‑art methods on two datasets, despite a tenfold variance in floating‑point operations and a fifteenfold difference in network parameters.

Abstract:
This article presents a complete semantic scene understanding workflow using only a single 2D lidar. This fills the gap in 2D lidar semantic segmentation, thereby enabling the rethinking and enhancement of existing 2D lidar‑based algorithms for application in various mobile robot tasks. It introduces the first publicly available 2D lidar semantic segmentation dataset and the first fine‑grained semantic segmentation algorithm specifically designed for 2D lidar sensors on autonomous mobile robots. To annotate this dataset, we propose a novel semi‑automatic semantic labeling framework that requires minimal human effort and provides point‑level semantic annotations. The data was collected by three different types of 2D lidar sensors across twelve indoor environments, featuring a range of common indoor objects. Furthermore, the proposed semantic segmentation algorithm fully exploits raw lidar information ‑‑ position, range, intensity, and incident angle ‑‑ to deliver stochastic, point‑wise semantic segmentation. We present a series of semantic occupancy grid mapping experiments and demonstrate two semantically‑aware navigation control policies based on 2D lidar. These results demonstrate that the proposed semantic 2D lidar dataset, semi‑automatic labeling framework, and segmentation algorithm are effective and can enhance different components of the robotic navigation pipeline. Multimedia resources are available at: https://youtu.be/P1Hsvj6WUSY.

Abstract:
Leveraging multiple training datasets to scale up image segmentation models is beneficial for increasing robustness and semantic understanding. Individual datasets have well‑defined ground truth with non‑overlapping mask layouts and mutually exclusive semantics. However, merging them for multi‑dataset training disrupts this harmony and leads to semantic inconsistencies; for example, the class "person" in one dataset and class "face" in another will require multilabel handling for certain pixels. Existing methods struggle with this setting, particularly when evaluated on label spaces mixed from the individual training sets. To overcome these issues, we introduce a simple yet effective multi‑dataset training approach by integrating language‑based embeddings of class names and label space‑specific query embeddings. Our method maintains high performance regardless of the underlying inconsistencies between training datasets. Notably, on four benchmark datasets with label space inconsistencies during inference, we outperform previous methods by 1.6% mIoU for semantic segmentation, 9.1% PQ for panoptic segmentation, 12.1% AP for instance segmentation, and 3.0% in the newly proposed PIQ metric.

Abstract:
Along with the rapid growth of autonomous vehicles (AVs), more and more demands are required for environment perception technology. Among others, HD mapping has become one of the more prominent roles in helping the vehicle realize essential tasks such as localization and path planning. While increasing research efforts have been directed toward HD Map development. However, a comprehensive overview of the overall HD map mapping and update framework is still lacking. This article introduces the development and current state of the algorithm involved in creating HD map mapping and its maintenance. As part of this study, the primary data preprocessing approach of processing raw data to information ready to feed for mapping and update purposes, semantic segmentation, and localization are also briefly reviewed. Moreover, the map taxonomy, ontology, and quality assessment are extensively discussed, the map data's general representation method is presented, and the mapping algorithm ranging from SLAM to transformers learning‑based approaches are also discussed. The development of the HD map update algorithm, from change detection to the update methods, is also presented. Finally, the authors discuss possible future developments and the remaining challenges in HD map mapping and update technology. This paper simultaneously serves as a position paper and tutorial to those new to HD map mapping and update domains.

Abstract:
Prototypical part learning is emerging as a promising approach for making semantic segmentation interpretable. The model selects real patches seen during training as prototypes and constructs the dense prediction map based on the similarity between parts of the test image and the prototypes. This improves interpretability since the user can inspect the link between the predicted output and the patterns learned by the model in terms of prototypical information. In this paper, we propose a method for interpretable semantic segmentation that leverages multi‑scale image representation for prototypical part learning. First, we introduce a prototype layer that explicitly learns diverse prototypical parts at several scales, leading to multi‑scale representations in the prototype activation output. Then, we propose a sparse grouping mechanism that produces multi‑scale sparse groups of these scale‑specific prototypical parts. This provides a deeper understanding of the interactions between multi‑scale object representations while enhancing the interpretability of the segmentation model. The experiments conducted on Pascal VOC, Cityscapes, and ADE20K demonstrate that the proposed method increases model sparsity, improves interpretability over existing prototype‑based methods, and narrows the performance gap with the non‑interpretable counterpart models. Code is available at github.com/eceo‑epfl/ScaleProtoSeg.

Abstract:
Surgical instrument segmentation is instrumental to minimally invasive surgeries and related applications. Most previous methods formulate this task as single‑frame‑based instance segmentation while ignoring the natural temporal and stereo attributes of a surgical video. As a result, these methods are less robust against the appearance variation through temporal motion and view change. In this work, we propose a novel LACOSTE model that exploits Location‑Agnostic COntexts in Stereo and TEmporal images for improved surgical instrument segmentation. Leveraging a query‑based segmentation model as core, we design three performance‑enhancing modules. Firstly, we design a disparity‑guided feature propagation module to enhance depth‑aware features explicitly. To generalize well for even only a monocular video, we apply a pseudo stereo scheme to generate complementary right images. Secondly, we propose a stereo‑temporal set classifier, which aggregates stereo‑temporal contexts in a universal way for making a consolidated prediction and mitigates transient failures. Finally, we propose a location‑agnostic classifier to decouple the location bias from mask prediction and enhance the feature semantics. We extensively validate our approach on three public surgical video datasets, including two benchmarks from EndoVis Challenges and one real radical prostatectomy surgery dataset GraSP. Experimental results demonstrate the promising performances of our method, which consistently achieves comparable or favorable results with previous state‑of‑the‑art approaches.

Abstract:
CAD models are widely used in industry and are essential for robotic automation processes. However, these models are rarely considered in novel AI‑based approaches, such as the automatic synthesis of robot programs, as there are no readily available methods that would allow CAD models to be incorporated for the analysis, interpretation, or extraction of information. To address these limitations, we propose QueryCAD, the first system designed for CAD question answering, enabling the extraction of precise information from CAD models using natural language queries. QueryCAD incorporates SegCAD, an open‑vocabulary instance segmentation model we developed to identify and select specific parts of the CAD model based on part descriptions. We further propose a CAD question answering benchmark to evaluate QueryCAD and establish a foundation for future research. Lastly, we integrate QueryCAD within an automatic robot program synthesis framework, validating its ability to enhance deep‑learning solutions for robotics by enabling them to process CAD models (https://claudius‑kienle.github.com/querycad).

Abstract:
Class Incremental Semantic Segmentation (CISS) aims to mitigate catastrophic forgetting by maintaining a balance between previously learned and newly introduced knowledge. Existing methods, primarily based on regularization techniques like knowledge distillation, help preserve old knowledge but often face challenges in effectively integrating new knowledge, resulting in limited overall improvement. Endpoints Weight Fusion (EWF) method, while simple, effectively addresses some of these limitations by dynamically fusing the model weights from previous steps with those from the current step, using a fusion parameter alpha determined by the relative number of previously known classes and newly introduced classes. However, the simplicity of the alpha calculation may limit its ability to fully capture the complexities of different task scenarios, potentially leading to suboptimal fusion outcomes. In this paper, we propose an enhanced approach called Adaptive Weight Fusion (AWF), which introduces an alternating training strategy for the fusion parameter, allowing for more flexible and adaptive weight integration. AWF achieves superior performance by better balancing the retention of old knowledge with the learning of new classes, significantly improving results on benchmark CISS tasks compared to the original EWF. And our experiment code will be released on Github.

Abstract:
We introduce VistaFormer, a lightweight Transformer‑based model architecture for the semantic segmentation of remote‑sensing images. This model uses a multi‑scale Transformer‑based encoder with a lightweight decoder that aggregates global and local attention captured in the encoder blocks. VistaFormer uses position‑free self‑attention layers which simplifies the model architecture and removes the need to interpolate temporal and spatial codes, which can reduce model performance when training and testing image resolutions differ. We investigate simple techniques for filtering noisy input signals like clouds and demonstrate that improved model scalability can be achieved by substituting Multi‑Head Self‑Attention (MHSA) with Neighbourhood Attention (NA). Experiments on the PASTIS and MTLCC crop‑type segmentation benchmarks show that VistaFormer achieves better performance than comparable models and requires only 8% of the floating point operations using MHSA and 11% using NA while also using fewer trainable parameters. VistaFormer with MHSA improves on state‑of‑the‑art mIoU scores by 0.1% on the PASTIS benchmark and 3% on the MTLCC benchmark while VistaFormer with NA improves on the MTLCC benchmark by 3.7%.

Abstract:
3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully‑supervised training remains too labor intensive and expensive. Semi‑supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self‑training framework for semi‑supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo‑labels and then filter these based on estimated point‑wise uncertainty. By constructing a heuristic n‑partite matching algorithm, we extend the method to semi‑supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state‑of‑the‑art results for our semi‑supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised‑only baselines on ScanRefer. Our project page is available at ouenal.github.io/bst/.

Abstract:
RGB‑D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial‑Aware Optimization (Depth SAO) as offset to represent real‑world spatial relationships. Secondly, the similarity in the feature space of RGB‑D is learned by Depth Linear Cross‑Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi‑scale features for meeting real‑time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state‑of‑the‑art performance on the KITTI (97.57% F‑score on KITTI road and 68.74% mIoU on KITTI‑360) and Cityscapes (83.4% mIoU) datasets.

Abstract:
Online object segmentation and tracking in Lidar point clouds enables autonomous agents to understand their surroundings and make safe decisions. Unfortunately, manual annotations for these tasks are prohibitively costly. We tackle this problem with the task of class‑agnostic unsupervised online instance segmentation and tracking. To that end, we leverage an instance segmentation backbone and propose a new training recipe that enables the online tracking of objects. Our network is trained on pseudo‑labels, eliminating the need for manual annotations. We conduct an evaluation using metrics adapted for temporal instance segmentation. Computing these metrics requires temporally‑consistent instance labels. When unavailable, we construct these labels using the available 3D bounding boxes and semantic labels in the dataset. We compare our method against strong baselines and demonstrate its superiority across two different outdoor Lidar datasets.

Abstract:
Surgical scenes convey crucial information about the quality of surgery. Pixel‑wise localization of tools and anatomical structures is the first task towards deeper surgical analysis for microscopic or endoscopic surgical views. This is typically done via fully‑supervised methods which are annotation greedy and in several cases, demanding medical expertise. Considering the profusion of surgical videos obtained through standardized surgical workflows, we propose an annotation‑efficient framework for the semantic segmentation of surgical scenes. We employ image‑based self‑supervised object discovery to identify the most salient tools and anatomical structures in surgical videos. These proposals are further refined within a minimally supervised fine‑tuning step. Our unsupervised setup reinforced with only 36 annotation labels indicates comparable localization performance with fully‑supervised segmentation models. Further, leveraging surgical phase labels as weak labels can better guide model attention towards surgical tools, leading to ～ 2% improvement in tool localization. Extensive ablation studies on the CaDIS dataset validate the effectiveness of our proposed solution in discovering relevant surgical objects with minimal or no supervision.

Abstract:
We propose Vision Token Turing Machines (ViTTM), an efficient, low‑latency, memory‑augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non‑sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read‑write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet‑1K, the state‑of‑the‑art ViT‑B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM‑B is 56% faster (234.1ms), with 2.4 times fewer FLOPs, with an accuracy of 82.9%. On ADE20K semantic segmentation, ViT‑B achieves 45.65mIoU at 13.8 frame‑per‑second (FPS) whereas our ViTTM‑B model acheives a 45.17 mIoU with 26.8 FPS (+94%).

Abstract:
Computed tomography (CT) reconstruction plays a crucial role in industrial nondestructive testing and medical diagnosis. Sparse view CT reconstruction aims to reconstruct high‑quality CT images while only using a small number of projections, which helps to improve the detection speed of industrial assembly lines and is also meaningful for reducing radiation in medical scenarios. Sparse CT reconstruction methods based on implicit neural representations (INRs) have recently shown promising performance, but still produce artifacts because of the difficulty of obtaining useful prior information. In this work, we incorporate a powerful prior: the total number of material categories of objects. To utilize the prior, we design AC‑IND, a self‑supervised method based on Attenuation Coefficient Estimation and Implicit Neural Distribution. Specifically, our method first transforms the traditional INR from scalar mapping to probability distribution mapping. Then we design a compact attenuation coefficient estimator initialized with values from a rough reconstruction and fast segmentation. Finally, our algorithm finishes the CT reconstruction by jointly optimizing the estimator and the generated distribution. Through experiments, we find that our method not only outperforms the comparative methods in sparse CT reconstruction but also can automatically generate semantic segmentation maps.

Abstract:
Instance segmentation of remote sensing images (RSIs) is an essential task for a wide range of applications such as land planning and intelligent transport. Instance segmentation of RSIs is constantly plagued by the unbalanced ratio of foreground and background and limited instance size. And most of the instance segmentation models are based on deep feature learning and contain operations such as multiple downsampling, which is harmful to instance segmentation of RSIs, and thus the performance is still limited. Inspired by the recent superior performance of prompt learning in visual tasks, we propose a new prompt paradigm to address the above issues. Based on the existing instance segmentation model, firstly, a local prompt module is designed to mine local prompt information from original local tokens for specific instances; secondly, a global‑to‑local prompt module is designed to model the contextual information from the global tokens to the local tokens where the instances are located for specific instances. Finally, a proposal's area loss function is designed to add a decoupling dimension for proposals on the scale to better exploit the potential of the above two prompt modules. It is worth mentioning that our proposed approach can extend the instance segmentation model to a promptable instance segmentation model, i.e., to segment the instances with the specific boxes prompt. The time consumption for each promptable instance segmentation process is only 40 ms. The paper evaluates the effectiveness of our proposed approach based on several existing models in four instance segmentation datasets of RSIs, and thorough experiments prove that our proposed approach is effective for addressing the above issues and is a competitive model for instance segmentation of RSIs.

Abstract:
3D perception in LiDAR point clouds is crucial for a self‑driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self‑supervised pre‑training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre‑training on point clouds only. Yet an autonomous driving vehicle is typically supplied with multiple sensors including cameras and LiDAR. In this context, we systematically study single modality, cross‑modality, and multi‑modality for contrastive learning of point clouds, and show that cross‑modality wins over other alternatives. In addition, considering the huge difference between the training sources in 2D images and 3D point clouds, it remains unclear how to design more effective contrastive units for LiDAR. We therefore propose the instance‑aware and similarity‑balanced contrastive units that are tailored for self‑driving point clouds. Extensive experiments reveal that our approach achieves remarkable performance gains over various point cloud models across the downstream perception tasks of LiDAR based 3D object detection and 3D semantic segmentation on the four popular benchmarks including Waymo Open Dataset, nuScenes, SemanticKITTI and ONCE.

Abstract:
This research introduces an advanced method for diagnosing diseases in sweet orange leaves by utilising advanced artificial intelligence models like YOLOv8 . Due to their significance as a vital agricultural product, sweet oranges encounter significant threats from a variety of diseases that harmfully affect both their yield and quality. Conventional methods for disease detection primarily depend on manual inspection which is ineffective and frequently leads to errors, resulting in delayed treatment and increased financial losses. In response to this challenge, the research utilized YOLOv8 , harnessing their proficiencies in detecting objects and analyzing images. YOLOv8 is recognized for its rapid and precise performance, while VIT is acknowledged for its detailed feature extraction abilities. Impressively, during both the training and validation stages, YOLOv8 exhibited a perfect accuracy of 80.4%, while VIT achieved an accuracy of 99.12%, showcasing their potential to transform disease detection in agriculture. The study comprehensively examined the practical challenges related to the implementation of AI technologies in agriculture, encompassing the computational demands and user accessibility, and offering viable solutions for broader usage. Moreover, it underscores the environmental considerations, particularly the potential for reduced pesticide usage, thereby promoting sustainable farming and environmental conservation. These findings provide encouraging insights into the application of AI in agriculture, suggesting a transition towards more effective, sustainable, and technologically advanced farming methods. This research not only highlights the efficacy of YOLOv8 within a specific agricultural domain but also lays the foundation for further studies that encompass a broader application in crop management and sustainable agricultural practices.

Abstract:
Semantic segmentation is a vital task in the field of remote sensing (RS). However, conventional convolutional neural network (CNN) and transformer‑based models face limitations in capturing long‑range dependencies or are often computationally intensive. Recently, an advanced state space model (SSM), namely Mamba, was introduced, offering linear computational complexity while effectively establishing long‑distance dependencies. Despite their advantages, Mamba‑based methods encounter challenges in preserving local semantic information. To cope with these challenges, this paper proposes a novel network called Pyramid Pooling Mamba (PPMamba), which integrates CNN and Mamba for RS semantic segmentation tasks. The core structure of PPMamba, the Pyramid Pooling‑State Space Model (PP‑SSM) block, combines a local auxiliary mechanism with an omnidirectional state space model (OSS) that selectively scans feature maps from eight directions, capturing comprehensive feature information. Additionally, the auxiliary mechanism includes pyramid‑shaped convolutional branches designed to extract features at multiple scales. Extensive experiments on two widely‑used datasets, ISPRS Vaihingen and LoveDA Urban, demonstrate that PPMamba achieves competitive performance compared to state‑of‑the‑art models.

Abstract:
In recent years, there has been a growing interest in Semantic Image Synthesis (SIS) through the use of Generative Adversarial Networks (GANs) and diffusion models. This field has seen innovations such as the implementation of specialized loss functions tailored for this task, diverging from the more general approaches in Image‑to‑Image (I2I) translation. While the concept of Semantic Video Synthesis (SVS)\unicodex2013the generation of temporally coherent, realistic sequences of images from semantic maps\unicodex2013is newly formalized in this paper, some existing methods have already explored aspects of this field. Most of these approaches rely on generic loss functions designed for video‑to‑video translation or require additional data to achieve temporal coherence. In this paper, we introduce the SVS‑GAN, a framework specifically designed for SVS, featuring a custom architecture and loss functions. Our approach includes a triple‑pyramid generator that utilizes SPADE blocks. Additionally, we employ a U‑Net‑based network for the image discriminator, which performs semantic segmentation for the OASIS loss. Through this combination of tailored architecture and objective engineering, our framework aims to bridge the existing gap between SIS and SVS, outperforming current state‑of‑the‑art models on datasets like Cityscapes and KITTI‑360.

Abstract:
We introduce Segmentation by Factorization (F‑SEG), an unsupervised segmentation method for pathology that generates segmentation masks from pre‑trained deep learning models. F‑SEG allows the use of pre‑trained deep neural networks, including recently developed pathology foundation models, for semantic segmentation. It achieves this without requiring additional training or finetuning, by factorizing the spatial features extracted by the models into segmentation masks and their associated concept features. We create generic tissue phenotypes for H&E images by training clustering models for multiple numbers of clusters on features extracted from several deep learning models on The Cancer Genome Atlas Program (TCGA), and then show how the clusters can be used for factorizing corresponding segmentation masks using off‑the‑shelf deep learning models. Our results show that F‑SEG provides robust unsupervised segmentation capabilities for H&E pathology images, and that the segmentation quality is greatly improved by utilizing pathology foundation models. We discuss and propose methods for evaluating the performance of unsupervised segmentation in pathology.

Abstract:
The ICPR 2024 Competition on Safe Segmentation of Drive Scenes in Unstructured Traffic and Adverse Weather Conditions served as a rigorous platform to evaluate and benchmark state‑of‑the‑art semantic segmentation models under challenging conditions for autonomous driving. Over several months, participants were provided with the IDD‑AW dataset, consisting of 5000 high‑quality RGB‑NIR image pairs, each annotated at the pixel level and captured under adverse weather conditions such as rain, fog, low light, and snow. A key aspect of the competition was the use and improvement of the Safe mean Intersection over Union (Safe mIoU) metric, designed to penalize unsafe incorrect predictions that could be overlooked by traditional mIoU. This innovative metric emphasized the importance of safety in developing autonomous driving systems. The competition showed significant advancements in the field, with participants demonstrating models that excelled in semantic segmentation and prioritized safety and robustness in unstructured and adverse conditions. The results of the competition set new benchmarks in the domain, highlighting the critical role of safety in deploying autonomous vehicles in real‑world scenarios. The contributions from this competition are expected to drive further innovation in autonomous driving technology, addressing the critical challenges of operating in diverse and unpredictable environments.

Abstract:
Perceiving the surrounding environment is a fundamental task in autonomous driving. To obtain highly accurate perception results, modern autonomous driving systems typically employ multi‑modal sensors to collect comprehensive environmental data. Among these, the radar‑camera multi‑modal perception system is especially favored for its excellent sensing capabilities and cost‑effectiveness. However, the substantial modality differences between radar and camera sensors pose challenges in fusing information. To address this problem, this paper presents RCBEVDet, a radar‑camera fusion 3D object detection framework. Specifically, RCBEVDet is developed from an existing camera‑based 3D object detector, supplemented by a specially designed radar feature extractor, RadarBEVNet, and a Cross‑Attention Multi‑layer Fusion (CAMF) module. Firstly, RadarBEVNet encodes sparse radar points into a dense bird's‑eye‑view (BEV) feature using a dual‑stream radar backbone and a Radar Cross Section aware BEV encoder. Secondly, the CAMF module utilizes a deformable attention mechanism to align radar and camera BEV features and adopts channel and spatial fusion layers to fuse them. To further enhance RCBEVDet's capabilities, we introduce RCBEVDet++, which advances the CAMF through sparse fusion, supports query‑based multi‑view camera perception models, and adapts to a broader range of perception tasks. Extensive experiments on the nuScenes show that our method integrates seamlessly with existing camera‑based 3D perception models and improves their performance across various perception tasks. Furthermore, our method achieves state‑of‑the‑art radar‑camera fusion results in 3D object detection, BEV semantic segmentation, and 3D multi‑object tracking tasks. Notably, with ViT‑L as the image backbone, RCBEVDet++ achieves 72.73 NDS and 67.34 mAP in 3D object detection without test‑time augmentation or model ensembling.

Abstract:
We tackle the challenging problem of source‑free unsupervised domain adaptation (SFUDA) for 3D semantic segmentation. It amounts to performing domain adaptation on an unlabeled target domain without any access to source data; the available information is a model trained to achieve good performance on the source domain. A common issue with existing SFUDA approaches is that performance degrades after some training time, which is a by product of an under‑constrained and ill‑posed problem. We discuss two strategies to alleviate this issue. First, we propose a sensible way to regularize the learning problem. Second, we introduce a novel criterion based on agreement with a reference model. It is used (1) to stop the training when appropriate and (2) as validator to select hyperparameters without any knowledge on the target domain. Our contributions are easy to implement and readily amenable for all SFUDA methods, ensuring stable improvements over all baselines. We validate our findings on various 3D lidar settings, achieving state‑of‑the‑art performance. The project repository (with code) is: github.com/valeoai/TTYD.

Abstract:
In this research, we introduce a unified end‑to‑end Automated Defect Classification‑Detection‑Segmentation (ADCDS) framework for classifying, detecting, and segmenting multiple instances of semiconductor defects for advanced nodes. This framework consists of two modules: (a) a defect detection module, followed by (b) a defect segmentation module. The defect detection module employs Deformable DETR to aid in the classification and detection of nano‑scale defects, while the segmentation module utilizes BoxSnake. BoxSnake facilitates box‑supervised instance segmentation of nano‑scale defects, supported by the former module. This simplifies the process by eliminating the laborious requirement for ground‑truth pixel‑wise mask annotation by human experts, which is typically associated with training conventional segmentation models. We have evaluated the performance of our ADCDS framework using two distinct process datasets from real wafers, as ADI and AEI, specifically focusing on Line‑space patterns. We have demonstrated the applicability and significance of our proposed methodology, particularly in the nano‑scale segmentation and generation of binary defect masks, using the challenging ADI SEM dataset where ground‑truth pixelwise segmentation annotations were unavailable. Furthermore, we have presented a comparative analysis of our proposed framework against previous approaches to demonstrate its effectiveness. Our proposed framework achieved an overall mAP@IoU0.5 of 72.19 for detection and 78.86 for segmentation on the ADI dataset. Similarly, for the AEI dataset, these metrics were 90.38 for detection and 95.48 for segmentation. Thus, our proposed framework effectively fulfils the requirements of advanced defect analysis while addressing significant constraints.

Abstract:
Delineating and classifying individual cells in microscopy tissue images is inherently challenging yet remains essential for advancements in medical and neuroscientific research. In this work, we propose a new deep learning framework, CISCA, for automatic cell instance segmentation and classification in histological slices. At the core of CISCA is a network architecture featuring a lightweight U‑Net with three heads in the decoder. The first head classifies pixels into boundaries between neighboring cells, cell bodies, and background, while the second head regresses four distance maps along four directions. The outputs from the first and second heads are integrated through a tailored post‑processing step, which ultimately produces the segmentation of individual cells. The third head enables the simultaneous classification of cells into relevant classes, if required. We demonstrate the effectiveness of our method using four datasets, including CoNIC, PanNuke, and MoNuSeg, which are publicly available H&Estained datasets that cover diverse tissue types and magnifications. In addition, we introduce CytoDArk0, the first annotated dataset of Nissl‑stained histological images of the mammalian brain, containing nearly 40k annotated neurons and glia cells, aimed at facilitating advancements in digital neuropathology and brain cytoarchitecture studies. We evaluate CISCA against other state‑of‑the‑art methods, demonstrating its versatility, robustness, and accuracy in segmenting and classifying cells across diverse tissue types, magnifications, and staining techniques. This makes CISCA well‑suited for detailed analyses of cell morphology and efficient cell counting in both digital pathology workflows and brain cytoarchitecture research.

Abstract:
To date, several methods have been developed to explain deep learning algorithms for classification tasks. Recently, an adaptation of two of such methods has been proposed to generate instance‑level explainable maps in a semantic segmentation scenario, such as multiple sclerosis (MS) lesion segmentation. In the mentioned work, a 3D U‑Net was trained and tested for MS lesion segmentation, yielding an F1 score of 0.7006, and a positive predictive value (PPV) of 0.6265. The distribution of values in explainable maps exposed some differences between maps of true and false positive (TP/FP) examples. Inspired by those results, we explore in this paper the use of characteristics of lesion‑specific saliency maps to refine segmentation and detection scores. We generate around 21000 maps from as many TP/FP lesions in a batch of 72 patients (training set) and 4868 from the 37 patients in the test set. 93 radiomic features extracted from the first set of maps were used to train a logistic regression model and classify TP versus FP. On the test set, F1 score and PPV were improved by a large margin when compared to the initial model, reaching 0.7450 and 0.7817, with 95% confidence intervals of [0.7358, 0.7547] and [0.7679, 0.7962], respectively. These results suggest that saliency maps can be used to refine prediction scores, boosting a model's performances.

Abstract:
Foundation models (FMs) are a popular topic of research in AI. Their ability to generalize to new tasks and datasets without retraining or needing an abundance of data makes them an appealing candidate for applications on specialist datasets. In this work, we compare the performance of FMs to finetuned pre‑trained supervised models in the task of semantic segmentation on an entirely new dataset. We see that finetuned models consistently outperform the FMs tested, even in cases were data is scarce. We release the code and dataset for this work on GitHub.

Abstract:
Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture‑wise and component‑wise is mandatory to excel in the speedaccuracy trade‑off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware‑efficiency in macro design. Additionally we introduce a simple slimmed‑down version of MultiHead Self‑Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware‑efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state‑of‑the‑art efficient backbones. In order to prove the generalizability of our hardware‑efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware‑efficient architecture. Code and models are available at https://github.com/ altair199797/LowFormer.

Abstract:
We introduce a novel method for updating 3D geospatial models, specifically targeting occlusion removal in large‑scale maritime environments. Traditional 3D reconstruction techniques often face problems with dynamic objects, like cars or vessels, that obscure the true environment, leading to inaccurate models or requiring extensive manual editing. Our approach leverages deep learning techniques, including instance segmentation and generative inpainting, to directly modify both the texture and geometry of 3D meshes without the need for costly reprocessing. By selectively targeting occluding objects and preserving static elements, the method enhances both geometric and visual accuracy. This approach not only preserves structural and textural details of map data but also maintains compatibility with current geospatial standards, ensuring robust performance across diverse datasets. The results demonstrate significant improvements in 3D model fidelity, making this method highly applicable for maritime situational awareness and the dynamic display of auxiliary information.

Abstract:
Unmanned Aerial Vehicles (UAVs), have greatly revolutionized the process of gathering and analyzing data in diverse research domains, providing unmatched adaptability and effectiveness. This paper presents a thorough examination of Unmanned Aerial Vehicle (UAV) datasets, emphasizing their wide range of applications and progress. UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos. These datasets can be categorized as either unimodal or multimodal, offering a wide range of detailed and comprehensive information. These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking. They facilitate the development of sophisticated models for tasks like semantic segmentation, pose estimation, vehicle re‑identification, and gesture recognition. By leveraging UAV datasets, researchers can significantly enhance the capabilities of computer vision models, thereby advancing technology and improving our understanding of complex, dynamic environments from an aerial perspective. This review aims to encapsulate the multifaceted utility of UAV datasets, emphasizing their pivotal role in driving innovation and practical applications in multiple domains.

Abstract:
Transfer learning based on full fine‑tuning (FFT) of the pre‑trained encoder and task‑specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine‑tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task‑specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input‑Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input‑Conditioned Network (iCoN) in the dynamic adapter that enables instance‑level feature transformation. To be specific, iCoN generates channel‑wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task‑specific and fine‑grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.

Abstract:
Unsupervised Domain Adaptation (UDA) endeavors to bridge the gap between a model trained on a labeled source domain and its deployment in an unlabeled target domain. However, current high‑performance models demand significant resources, making deployment costs prohibitive and highlighting the need for compact, yet effective models. For UDA of lightweight models, Knowledge Distillation (KD) leveraging a Teacher‑Student framework could be a common approach, but we found that domain shift in UDA leads to a significant increase in non‑salient parameters in the teacher model, degrading model's generalization ability and transferring misleading information to the student model. Interestingly, we observed that this phenomenon occurs considerably less in the student model. Driven by this insight, we introduce Collaborative Learning for UDA (CLDA), a method that updates the teacher's non‑salient parameters using the student model and at the same time utilizes the updated teacher model to improve UDA performance of the student model. Experiments show consistent performance improvements for both student and teacher models. For example, in semantic segmentation, CLDA achieves an improvement of +0.7% mIoU for the teacher model and +1.4% mIoU for the student model compared to the baseline model in the GTA‑to‑Cityscapes datasets. In the Synthia‑to‑Cityscapes dataset, it achieves an improvement of +0.8% mIoU and +2.0% mIoU for the teacher and student models, respectively.

Abstract:
Segment Anything Model (SAM) has demonstrated powerful zero‑shot segmentation performance in natural scenes. The recently released Segment Anything Model 2 (SAM2) has further heightened researchers' expectations towards image segmentation capabilities. To evaluate the performance of SAM2 on class‑agnostic instance‑level segmentation tasks, we adopt different prompt strategies for SAM2 to cope with instance‑level tasks for three relevant scenarios: Salient Instance Segmentation (SIS), Camouflaged Instance Segmentation (CIS), and Shadow Instance Detection (SID). In addition, to further explore the effectiveness of SAM2 in segmenting granular object structures, we also conduct detailed tests on the high‑resolution Dichotomous Image Segmentation (DIS) benchmark to assess the fine‑grained segmentation capability. Qualitative and quantitative experimental results indicate that the performance of SAM2 varies significantly across different scenarios. Besides, SAM2 is not particularly sensitive to segmenting high‑resolution fine details. We hope this technique report can drive the emergence of SAM2‑based adapters, aiming to enhance the performance ceiling of large vision models on class‑agnostic instance segmentation tasks.

Abstract:
Masked Image Modeling (MIM) techniques have redefined the landscape of computer vision, enabling pre‑trained models to achieve exceptional performance across a broad spectrum of tasks. Despite their success, the full potential of MIM‑based methods in dense prediction tasks, particularly in depth estimation, remains untapped. Existing MIM approaches primarily rely on single‑image inputs, which makes it challenging to capture the crucial structured information, leading to suboptimal performance in tasks requiring fine‑grained feature representation. To address these limitations, we propose SG‑MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images. SG‑MIM employs a lightweight relational guidance framework, allowing it to guide structured knowledge individually at the feature level rather than naively combining at the pixel level within the same architecture, as is common in traditional multi‑modal pre‑training methods. This approach enables the model to efficiently capture essential information while minimizing discrepancies between pre‑training and downstream tasks. Furthermore, SG‑MIM employs a selective masking strategy to incorporate structured knowledge, maximizing the synergy between general representation learning and structured knowledge‑specific learning. Our method requires no additional annotations, making it a versatile and efficient solution for a wide range of applications. Our evaluations on the KITTI, NYU‑v2, and ADE20k datasets demonstrate SG‑MIM's superiority in monocular depth estimation and semantic segmentation.

Abstract:
K‑Origins is a neural network layer designed to improve image‑based network performances when learning colour, or intensities, is beneficial. Over 250 encoder‑decoder convolutional networks are trained and tested on 16‑bit synthetic data, demonstrating that K‑Origins improves semantic segmentation accuracy in two scenarios: object detection with low signal‑to‑noise ratios, and segmenting multiple objects that are identical in shape but vary in colour. K‑Origins generates output features from the input features, X, by the equation Y_k = X‑J\cdot w_k for each trainable parameter w_k, where J is a matrix of ones. Additionally, networks with varying receptive fields were trained to determine optimal network depths based on the dimensions of target classes, suggesting that receptive field lengths should exceed object sizes. By ensuring a sufficient receptive field length and incorporating K‑Origins, we can achieve better semantic network performance.

Abstract:
Food computing is both important and challenging in computer vision (CV). It significantly contributes to the development of CV algorithms due to its frequent presence in datasets across various applications, ranging from classification and instance segmentation to 3D reconstruction. The polymorphic shapes and textures of food, coupled with high variation in forms and vast multimodal information, including language descriptions and nutritional data, make food computing a complex and demanding task for modern CV algorithms. 3D food modeling is a new frontier for addressing food related problems, due to its inherent capability to deal with random camera views and its straightforward representation for calculating food portion size. However, the primary hurdle in the development of algorithms for food object analysis is the lack of nutrition values in existing 3D datasets. Moreover, in the broader field of 3D research, there is a critical need for domain‑specific test datasets. To bridge the gap between general 3D vision and food computing research, we introduce MetaFood3D. This dataset consists of 743 meticulously scanned and labeled 3D food objects across 131 categories, featuring detailed nutrition information, weight, and food codes linked to a comprehensive nutrition database. Our MetaFood3D dataset emphasizes intra‑class diversity and includes rich modalities such as textured mesh files, RGB‑D videos, and segmentation masks. Experimental results demonstrate our dataset's strong capabilities in enhancing food portion estimation algorithms, highlight the gap between video captures and 3D scanned data, and showcase the strengths of MetaFood3D in generating synthetic eating occasion data and 3D food objects.

Abstract:
Expanding the receptive field in a deep learning model for large‑scale 3D point cloud segmentation is an effective technique for capturing rich contextual information, which consequently enhances the network's ability to learn meaningful features. However, this often leads to increased computational complexity and risk of overfitting, challenging the efficiency and effectiveness of the learning paradigm. To address these limitations, we propose the Local Split Attention Pooling (LSAP) mechanism to effectively expand the receptive field through a series of local split operations, thus facilitating the acquisition of broader contextual knowledge. Concurrently, it optimizes the computational workload associated with attention‑pooling layers to ensure a more streamlined processing workflow. Based on LSAP, a Parallel Aggregation Enhancement (PAE) module is introduced to enable parallel processing of data using both 2D and 3D neighboring information to further enhance contextual representations within the network. In light of the aforementioned designs, we put forth a novel framework, designated as LSNet, for large‑scale point cloud semantic segmentation. Extensive evaluations demonstrated the efficacy of seamlessly integrating the proposed PAE module into existing frameworks, yielding significant improvements in mean intersection over union (mIoU) metrics, with a notable increase of up to 11%. Furthermore, LSNet demonstrated superior performance compared to state‑of‑the‑art semantic segmentation networks on three benchmark datasets, including S3DIS, Toronto3D, and SensatUrban. It is noteworthy that our method achieved a substantial speedup of approximately 38.8% compared to those employing similar‑sized receptive fields, which serves to highlight both its computational efficiency and practical utility in real‑world large‑scale scenes.

Abstract:
In this paper, we introduce a hierarchical transformer‑based model designed for sophisticated image segmentation tasks, effectively bridging the granularity of part segmentation with the comprehensive scope of object segmentation. At the heart of our approach is a multi‑level representation strategy, which systematically advances from individual pixels to superpixels, and ultimately to cohesive group formations. This architecture is underpinned by two pivotal aggregation strategies: local aggregation and global aggregation. Local aggregation is employed to form superpixels, leveraging the inherent redundancy of the image data to produce segments closely aligned with specific parts of the object, guided by object‑level supervision. In contrast, global aggregation interlinks these superpixels, organizing them into larger groups that correlate with entire objects and benefit from part‑level supervision. This dual aggregation framework ensures a versatile adaptation to varying supervision inputs while maintaining computational efficiency. Our methodology notably improves the balance between adaptability across different supervision modalities and computational manageability, culminating in significant enhancement in segmentation performance. When tested on the PartImageNet dataset, our model achieves a substantial increase, outperforming the previous state‑of‑the‑art by 2.8% and 0.8% in mIoU scores for part and object segmentation, respectively. Similarly, on the Pascal Part dataset, it records performance enhancements of 1.5% and 2.0% for part and object segmentation, respectively.

Abstract:
Test‑time domain adaption (TTDA) for semantic segmentation aims to adapt a segmentation model trained on a source domain to a target domain for inference on‑the‑fly, where both efficiency and effectiveness are critical. However, existing TTDA methods either rely on costly frame‑wise optimization or assume unrealistic domain shifts, resulting in poor adaptation efficiency and continuous semantic ambiguities. To address these challenges, we propose a real‑time framework for TTDA semantic segmentation, called Dynamic Ambiguity‑Wise Adaptation (DAWA), which adaptively detects domain shifts and dynamically adjusts the learning strategies to mitigate continuous ambiguities in the test time. Specifically, we introduce the Dynamic Ambiguous Patch Mask (DAP Mask) strategy, which dynamically identifies and masks highly disturbed regions to prevent error accumulation in ambiguous classes. Furthermore, we present the Dynamic Ambiguous Class Mix (DAC Mix) strategy that leverages vision‑language models to group semantically similar classes and augment the target domain with a meta‑ambiguous class buffer. Extensive experiments on widely used TTDA benchmarks demonstrate that DAWA consistently outperforms state‑of‑the‑art methods, while maintaining real‑time inference speeds of approximately 40 FPS.

Abstract:
We explore Bird's‑Eye View (BEV) generation, converting a BEV map into its corresponding multi‑view street images. Valued for its unified spatial representation aiding multi‑sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street‑view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion‑based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high‑quality, and condition‑aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine‑tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi‑view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine‑tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition‑coherent street view images.

Abstract:
Implicit Neural Representations (INRs) have recently advanced the field of deep learning due to their ability to learn continuous representations of signals without the need for large training datasets. Although INR methods have been studied for medical image super‑resolution, their adaptability to localized priors in medical images has not been extensively explored. Medical images contain rich anatomical divisions that could provide valuable local prior information to enhance the accuracy and robustness of INRs. In this work, we propose a novel framework, referred to as the Semantically Conditioned INR (SeCo‑INR), that conditions an INR using local priors from a medical image, enabling accurate model fitting and interpolation capabilities to achieve super‑resolution. Our framework learns a continuous representation of the semantic segmentation features of a medical image and utilizes it to derive the optimal INR for each semantic region of the image. We tested our framework using several medical imaging modalities and achieved higher quantitative scores and more realistic super‑resolution outputs compared to state‑of‑the‑art methods.

Abstract:
Infrared and visible dual‑modality tasks such as semantic segmentation and object detection can achieve robust performance even in extreme scenes by fusing complementary information. Most current methods design task‑specific frameworks, which are limited in generalization across multiple tasks. In this paper, we propose a fusion‑guided infrared and visible general framework, IVGF, which can be easily extended to many high‑level vision tasks. Firstly, we adopt the SOTA infrared and visible foundation models to extract the general representations. Then, to enrich the semantics information of these general representations for high‑level vision tasks, we design the feature enhancement module and token enhancement module for feature maps and tokens, respectively. Besides, the attention‑guided fusion module is proposed for effectively fusing by exploring the complementary information of two modalities. Moreover, we also adopt the cutout&mix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementary between the two modalities. Extensive experiments show that the IVGF outperforms state‑of‑the‑art dual‑modality methods in the semantic segmentation and object detection tasks. The detailed ablation studies demonstrate the effectiveness of each module, and another experiment explores the anti‑missing modality ability of the proposed method in the dual‑modality semantic segmentation task.

Abstract:
Pre‑trained on extensive and diverse multi‑modal datasets, 2D foundation models excel at addressing 2D tasks with little or no downstream supervision, owing to their robust representations. The emergence of 2D‑to‑3D distillation frameworks has extended these capabilities to 3D models. However, distilling 3D representations for autonomous driving datasets presents challenges like self‑similarity, class imbalance, and point cloud sparsity, hindering the effectiveness of contrastive distillation, especially in zero‑shot learning contexts. Whereas other methodologies, such as similarity‑based distillation, enhance zero‑shot performance, they tend to yield less discriminative representations, diminishing few‑shot performance. We investigate the gap in structure between the 2D and the 3D representations that result from state‑of‑the‑art distillation frameworks and reveal a significant mismatch between the two. Additionally, we demonstrate that the observed structural gap is negatively correlated with the efficacy of the distilled representations on zero‑shot and few‑shot 3D semantic segmentation. To bridge this gap, we propose a relational distillation framework enforcing intra‑modal and cross‑modal constraints, resulting in distilled 3D representations that closely capture the structure of the 2D representation. This alignment significantly enhances 3D representation performance over those learned through contrastive distillation in zero‑shot segmentation tasks. Furthermore, our relational loss consistently improves the quality of 3D representations in both in‑distribution and out‑of‑distribution few‑shot segmentation tasks, outperforming approaches that rely on the similarity loss.

Abstract:
Despite the eye‑catching breakthroughs achieved by deep visual networks in detecting region‑level surface defects, the challenge of high‑quality pixel‑wise defect detection remains due to diverse defect appearances and data scarcity. To avoid over‑reliance on defect appearance and achieve accurate defect segmentation, we proposed a change‑aware Siamese network that solves the defect segmentation in a change detection framework. A novel multi‑class balanced contrastive loss is introduced to guide the Transformer‑based encoder, which enables encoding diverse categories of defects as the unified class‑agnostic difference between defect and defect‑free images. The difference presented by a distance map is then skip‑connected to the change‑aware decoder to assist in the location of both inter‑class and out‑of‑class pixel‑wise defects. In addition, we proposed a synthetic dataset with multi‑class liquid crystal display (LCD) defects under a complex and disjointed background context, to demonstrate the advantages of change‑based modeling over appearance‑based modeling for defect segmentation. In our proposed dataset and two public datasets, our model achieves superior performances than the leading semantic segmentation methods, while maintaining a relatively small model size. Moreover, our model achieves a new state‑of‑the‑art performance compared to the semi‑supervised approaches in various supervision settings.

Abstract:
In this study, we tackle the challenge of identifying plant species from ultra high resolution (UHR) remote sensing images. Our approach involves introducing an RGB remote sensing dataset, characterized by millimeter‑level spatial resolution, meticulously curated through several field expeditions across a mountainous region in France covering various landscapes. The task of plant species identification is framed as a semantic segmentation problem for its practical and efficient implementation across vast geographical areas. However, when dealing with segmentation masks, we confront instances where distinguishing boundaries between plant species and their background is challenging. We tackle this issue by introducing a fuzzy loss within the segmentation model. Instead of utilizing one‑hot encoded ground truth (GT), our model incorporates Gaussian filter refined GT, introducing stochasticity during training. First experimental results obtained on both our UHR dataset and a public dataset are presented, showing the relevance of the proposed methodology, as well as the need for future improvement.